Walter Gautschi 


Numerical 
Analysis 


® Birkhauser 


Walter Gautschi 


Numerical Analysis 


Second Edition 


® Birkhauser 


Walter Gautschi 

Department of Computer Sciences 
Purdue University 

250 N. University Street 

West Lafayette, IN 47907-2066 
wgautschi@ purdue.edu 


ISBN 978-0-8176-8258-3 e-ISBN 978-0-8176-8259-0 
DOI 10.1007/978-0-8176-8259-0 
Springer New York Dordrecht Heidelberg London 


Library of Congress Control Number: 2011941359 


Mathematics Subject Classification (2010): 65-01, 65D05, 65D07, 65D10, 65D25, 65D30, 65D32, 
65H04, 65H0S5, 65H10, 65L04, 65L05, 65L06, 65L10 


© Springer Science+Business Media, LLC 1997, 2012 

All rights reserved. This work may not be translated or copied in whole or in part without the written 
permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, 
NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in 
connection with any form of information storage and retrieval, electronic adaptation, computer software, 
or by similar or dissimilar methodology now known or hereafter developed is forbidden. 

The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are 
not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject 
to proprietary rights. 


Printed on acid-free paper 


www.birkhauser-science.com 


Td 
ERIKA 


Preface to the Second Edition 


In this second edition, the outline of chapters and sections has been preserved. The 
subtitle “An Introduction”, as suggested by several reviewers, has been deleted. The 
content, however, is brought up to date, both in the text and in the notes. Many 
passages in the text have been either corrected or improved. Some biographical 
notes have been added as well as a few exercises and computer assignments. The 
typographical appearance has also been improved by printing vectors and matrices 
consistently in boldface types. 

With regard to computer language in illustrations and exercises, we now adopt 
uniformly Matlab. For readers not familiar with Matlab, there are a number of 
introductory texts available, some, like Moler [2004], Otto and Denier [2005], 
Stanoyevitch [2005] that combine Matlab with numerical computing, others, like 
Knight [2000], Higham and Higham [2005], Hunt, Lipsman and Rosenberg [2006], 
and Driscoll [2009], more exclusively focused on Matlab. 

The major novelty, however, is a complete set of detailed solutions to all exercises 
and machine assignments. The solution manual is available to instructors upon 
request at the publisher’s website http://www.birkhauser-science.com/978-0-8176- 
8258-3. Selected solutions are also included in the text to give students an idea of 
what is expected. The bibliography has been expanded to reflect technical advances 
in the field and to include references to new books and expository accounts. As a 
result, the text has undergone an expansion in size of about 20%. 


West Lafayette, Indiana Walter Gautschi 
November 2011 


Preface to the First Edition 


The book is designed for use in a graduate program in Numerical Analysis that 
is structured so as to include a basic introductory course and subsequent more 
specialized courses. The latter are envisaged to cover such topics as numerical 
linear algebra, the numerical solution of ordinary and partial differential equations, 
and perhaps additional topics related to complex analysis, to multidimensional 
analysis, in particular optimization, and to functional analysis and related functional 
equations. Viewed in this context, the first four chapters of our book could serve as 
a text for the basic introductory course, and the remaining three chapters (which 
indeed are at a distinctly higher level) could provide a text for an advanced course 
on the numerical solution of ordinary differential equations. In a sense, therefore, 
the book breaks with tradition in that it does no longer attempt to deal with all 
major topics of numerical mathematics. It is felt by the author that some of the 
current subdisciplines, particularly those dealing with linear algebra and partial 
differential equations, have developed into major fields of study that have attained 
a degree of autonomy and identity that justifies their treatment in separate books 
and separate courses on the graduate level. The term “Numerical Analysis” as 
used in this book, therefore, is to be taken in the narrow sense of the numerical 
analogue of Mathematical Analysis, comprising such topics as machine arithmetic, 
the approximation of functions, approximate differentiation and integration, and the 
approximate solution of nonlinear equations and of ordinary differential equations. 

What is being covered, on the other hand, is done so with a view toward 
stressing basic principles and maintaining simplicity and student-friendliness as far 
as possible. In this sense, the book is “An Introduction”. Topics that, even though 
important and of current interest, require a level of technicality that transcends the 
bounds of simplicity striven for, are referenced in detailed bibliographic notes at the 
end of each chapter. It is hoped, in this way, to place the material treated in proper 
context and to help, indeed encourage, the reader to pursue advanced modern topics 
in more depth. 

A significant feature of the book is the large collection of exercises that 
are designed to help the student develop problem-solving skills and to provide 
interesting extensions of topics treated in the text. Particular attention is given to 
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machine assignments, where the student is encouraged to implement numerical 
techniques on the computer and to make use of modern software packages. 

The author has taught the basic introductory course and the advanced course on 
ordinary differential equations regularly at Purdue University for the last 30 years 
or so. The former, typically, was offered both in the fall and spring semesters, to a 
mixed audience consisting of graduate (and some good undergraduate) students in 
mathematics, computer science, and engineering, while the latter was taught only in 
the fall, to a smaller but also mixed audience. Written notes began to materialize in 
the 1970s, when the author taught the basic course repeatedly in summer courses on 
Mathematics held in Perugia, Italy. Indeed, for some time, these notes existed only 
in the Italian language. Over the years, they were progressively expanded, updated, 
and transposed into English, and along with that, notes for the advanced course were 
developed. This, briefly, is how the present book evolved. 

A long gestation period such as this, of course, is not without dangers, the 
most notable one being a tendency for the material to become dated. The author 
tried to counteract this by constantly updating and revising the notes, adding newer 
developments when deemed appropriate. There are, however, benefits as well: over 
time, one develops a sense for what is likely to stand the test of time and what 
may only be of temporary interest, and one selects and deletes accordingly. Another 
benefit is the steady accumulation of exercises and the opportunity to have them 
tested on a large and diverse student population. 

The purpose of academic teaching, in the author’s view, is twofold: to transmit 
knowledge, and, perhaps more important, to kindle interest and even enthusiasm 
in the student. Accordingly, the author did not strive for comprehensiveness — 
even within the boundaries delineated — but rather tried to concentrate on what is 
essential, interesting and intellectually pleasing, and teachable. In line with this, 
an attempt has been made to keep the text uncluttered with numerical examples and 
other illustrative material. Being well aware, however, that mastery of a subject does 
not come from studying alone but from active participation, the author provided 
many exercises, including machine projects. Attributions of results to specific 
authors and citations to the literature have been deliberately omitted from the body 
of the text. Each chapter, as already mentioned, has a set of appended notes that 
help the reader to pursue related topics in more depth and to consult the specialized 
literature. It is here where attributions and historical remarks are made, and where 
citations to the literature — both textbook and research — appear. 

The main text is preceded by a prologue, which is intended to place the book in 
proper perspective. In addition to other textbooks on the subject, and information 
on software, it gives a detailed list of topics not treated in this book, but definitely 
belonging to the vast area of computational mathematics, and it provides ample 
references to relevant texts. A list of numerical analysis journals is also included. 

The reader is expected to have a good background in calculus and advanced 
calculus. Some passages of the text require a modest degree of acquaintance with 
linear algebra, complex analysis, or differential equations. These passages, however, 
can easily be skipped, without loss of continuity, by a student who is not familiar 
with these subjects. 
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It is a pleasure to thank the publisher for showing interest in this book and 
cooperating in producing it. The author is also grateful to Soren Jensen and Manil 
Suri, who taught from this text, and to an anonymous reader; they all made many 
helpful suggestions on improving the presentation. He is particularly indebted to 
Prof. Jensen for substantially helping in preparing the exercises to Chap. 7. The 
author further acknowledges assistance from Carl de Boor in preparing the notes to 
Chap. 2 and to Werner C. Rheinboldt for helping with the notes to Chap. 4. Last but 
not least, he owes a measure of gratitude to Connie Wilson for typing a preliminary 
version of the text and to Adam Hammer for assisting the author with the more 
intricate aspects of LaTeX. 


West Lafayette, Indiana Walter Gautschi 
January 1997 
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Prologue 


P1 Overview 


Numerical Analysis is the branch of mathematics that provides tools and methods 
for solving mathematical problems in numerical form. The objective is to develop 
detailed computational procedures, capable of being implemented on electronic 
computers, and to study their performance characteristics. Related fields are Sci- 
entific Computation, which explores the application of numerical techniques and 
computer architectures to concrete problems arising in the sciences and engineering; 
Complexity Theory, which analyzes the number of “operations” and the amount of 
computer memory required to solve a problem; and Parallel Computation, which 
is concerned with organizing computational procedures in a manner that allows 
running various parts of the procedures simultaneously on different processors. 

The problems dealt with in computational mathematics come from virtually 
all branches of pure and applied mathematics. There are computational aspects 
in number theory, combinatorics, abstract algebra, linear algebra, approximation 
theory, geometry, statistics, optimization, complex analysis, nonlinear equations, 
differential and other functional equations, and so on. It is clearly impossible 
to deal with all these topics in a single text of reasonable size. Indeed, the 
tendency today is to develop specialized texts dealing with one or the other 
of these topics. In the present text we concentrate on subject matters that are 
basic to problems in approximation theory, nonlinear equations, and differential 
equations. Accordingly, we have chapters on machine arithmetic, approximation 
and interpolation, numerical differentiation and integration, nonlinear equations, 
one-step and multistep methods for ordinary differential equations, and boundary 
value problems in ordinary differential equations. Important topics not covered 
in this text are computational number theory, algebra, and geometry; constructive 
methods in optimization and complex analysis; numerical linear algebra; and the 
numerical solution of problems involving partial differential equations and integral 
equations. Selected texts for these areas are enumerated in Sect. P3. 
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We now describe briefly the topics treated in this text. Chapter 1 deals with 
the basic facts of life regarding machine computation. It recognizes that, although 
present-day computers are extremely powerful in terms of computational speed, 
reliability, and amount of memory available, they are less than ideal — unless 
supplemented by appropriate software — when it comes to the precision available, 
and accuracy attainable, in the execution of elementary arithmetic operations. This 
raises serious questions as to how arithmetic errors, either present in the input 
data of a problem or committed during the execution of a solution algorithm, 
affect the accuracy of the desired results. Concepts and tools required to answer 
such questions are put forward in this introductory chapter. In Chap. 2, the central 
theme is the approximation of functions by simpler functions, typically polynomials 
and piecewise polynomial functions. Approximation in the sense of least squares 
provides an opportunity to introduce orthogonal polynomials, which are relevant 
also in connection with problems of numerical integration treated in Chap. 3. A large 
part of the chapter, however, deals with polynomial interpolation and associated 
error estimates, which are basic to many numerical procedures for integrating 
functions and differential equations. Also discussed briefly is inverse interpolation, 
an idea useful in solving equations. 

First applications of interpolation theory are given in Chap. 3, where the tasks 
presented are the computation of derivatives and definite integrals. Although the 
formulae developed for derivatives are subject to the detrimental effects of machine 
arithmetic, they are useful, nevertheless, for purposes of discretizing differential 
operators. The treatment of numerical integration includes routine procedures, such 
as the trapezoidal and Simpson’s rules, appropriate for well-behaved integrands, as 
well as the more sophisticated procedures based on Gaussian quadrature to deal 
with singularities. It is here where orthogonal polynomials reappear. The method of 
undetermined coefficients is another technique for developing integration formulae. 
It is applied to approximate general linear functionals, the Peano representation 
of linear functionals providing an important tool for estimating the error. The 
chapter ends with a discussion of extrapolation techniques; although applicable to 
more general problems, they are inserted here since the composite trapezoidal rule 
together with the Euler—Maclaurin formula provides the best-known application — 
Romberg integration. 

Chapter 4 deals with iterative methods for solving nonlinear equations and 
systems thereof, the piéce de résistance being Newton’s method. The emphasis here 
lies in the study of, and the tools necessary to analyze, convergence. The special 
case of algebraic equations is also briefly given attention. 

Chapter 5 is the first of three chapters devoted to the numerical solution of 
ordinary differential equations. It concerns itself with one-step methods for solving 
initial value problems, such as the Runge-Kutta method, and gives a detailed 
analysis of local and global errors. Also included is a brief introduction to stiff 
equations and special methods to deal with them. Multistep methods and, in 
particular, Dahlquist’s theory of stability and its applications, is the subject of 
Chap. 6. The final chapter (Chap. 7) is devoted to boundary value problems and their 
solution by shooting methods, finite difference techniques, and variational methods. 
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P2 Numerical Analysis Software 


There are many software packages available, both in the public domain and dis- 
tributed commercially, that deal with numerical analysis algorithms. A widely used 
source of numerical software is Netlib, accessible at http: //www.netlib.org. 

Large collections of general-purpose numerical algorithms are contained in 
sources such as Slatec (http://www.netlib.org/slatec) and TOMS 
(ACM Transactions on Mathematical Software). Specialized packages relevant 
to the topics in the chapters ahead are identified in the “Notes” to each chapter. 
Likewise, specific files needed to do some of the machine assignments in the 
Exercises are identified as part of the exercise. 

Among the commercial software packages we mention the Visual Numerics 
(formerly IMSL) and NAG libraries. Interactive systems include HiQ, Macsyma, 
Maple, Mathcad, Mathematica, and Matlab. Many of these packages, in addition 
to numerical computation, have symbolic computation and graphics capabilities. 
Further information is available in the Netlib file commercial. For more libraries, 
and for interactive systems, also see Lozier and Olver [1994, Sect. 3]. 

In this text we consistently use Matlab as a vehicle for describing algorithms 
and as the software tool for carrying out some of the exercises and all machine 
assignments. 


P3 Textbooks and Monographs 


We provide here an annotated list (ordered alphabetically with respect to authors) 
of other textbooks on numerical analysis, written at about the same, or higher, level 
as the present one. Following this, we also mention books and monographs dealing 
with topics in computational mathematics not covered in our (and many other) books 
on numerical analysis. Additional books dealing with specialized subject areas, as 
well as other literature, are referenced in the “Notes” to the individual chapters. We 
generally restrict ourselves to books written in English and, with a few exceptions, 
published within the last 25 years or so. Even so, we have had to be selective. (No 
value judgment is to be implied by our selections or omissions.) A reader with access 
to the AMS (American Mathematical Society) MathSci Net homepage will have no 
difficulty in retrieving a more complete list of relevant items, including older texts. 


P3.1 Selected Textbooks on Numerical Analysis 


Atkinson [1989] A comprehensive in-depth treatment of standard topics short of 
partial differential equations; includes an appendix describing some of the better- 
known software packages. 
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Atkinson and Han [2009] An advanced text on theoretical (as opposed to com- 
putational) aspects of numerical analysis, making extensive use of functional 
analysis. 

Bruce, Giblin, and Rippon [1990] A collection of interesting mathematical prob- 
lems, ranging from number theory and computer-aided design to differential 
equations, that require the use of computers for their solution. 

Cheney and Kincaid [1994] Although an undergraduate text, it covers a broad 
area, has many examples from science and engineering as well as computer 
programs; there are many exercises, including machine assignments. 

Conte and de Boor [1980] A widely used text for upper-division undergraduate 
students; written for a broad audience, with algorithmic concerns in the fore- 
ground; has Fortran subroutines for many algorithms discussed in the text. 

Dahlquist and Bjorck [2003, 2008] The first (2003) text — a reprint of the 1974 
classic — provides a comprehensive introduction to all major fields of numerical 
analysis, striking a good balance between theoretical issues and more practical 
ones. The second text expands substantially on the more elementary topics 
treated in the first and represents the first volume of more to come. 

Deuflhard and Hohmann [2003] An introductory text with emphasis on machine 
computation and algorithms; includes discussions of three-term recurrence 
relations and stochastic eigenvalue problems (not usually found in textbooks), 
but no differential equations. 

Froberg [1985] A thorough and exceptionally lucid exposition of all major topics 
of numerical analysis exclusive of algorithms and computer programs. 

Hammerlin and Hoffmann [1991] Similar to Stoer and Bulirsch [2002] in its 
emphasis on mathematical theory; has more on approximation theory and 
multivariate interpolation and integration, but nothing on differential equations. 

Householder [2006] A reissue of one of the early mathematical texts on the 
subject, with coverage limited to systems of linear and nonlinear equations and 
topics in approximation. 

Isaacson and Keller [1994] One of the older but still eminently readable texts, 
stressing the mathematical analysis of numerical methods. 

Kincaid and Cheney [1996] Related to Cheney and Kincaid [1994] but more 
mathematically oriented and unusually rich in exercises and bibliographic items. 

Kress [1998] A rather comprehensive text with a strong functional analysis 
component. 

Neumaier [2001] A text emphasizing robust computation, including interval 
arithmetic. 

Rutishauser [1990] An annotated translation from the German of an older text 
based on posthumous notes by one of the pioneers of numerical analysis; 
although the subject matter reflects the state of the art in the early 1970s, the 
treatment is highly original and is supplemented by translator’s notes to each 
chapter pointing to more recent developments. 

Schwarz [1989] A mathematically oriented treatment of all major areas of numer- 
ical analysis, including ordinary and partial differential equations. 
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Stoer and Bulirsch [2002] Fairly comprehensive in coverage; written in a style 
appealing more to mathematicians than engineers and computer scientists; has 
many exercises and bibliographic references; serves not only as a textbook but 
also as a reference work. 

Todd [1979, 1977] Rather unique books, emphasizing problem-solving in areas 
often not covered in other books on numerical analysis. 


P3.2_ Monographs and Books on Specialized Topics 


A collection of outstanding survey papers on specialized topics in numerical 
analysis is being assembled by Ciarlet and Lions [1990-2003] in handbooks of 
numerical analysis; nine volumes have appeared so far. Another source of surveys 
on a variety of topics is Acta numerica, an annual series of books edited by Iserles 
[1992-2010], of which 19 volumes have been published so far. For an authoritative 
account of the history of numerical analysis from the 16th through the 19th century, 
the reader is referred to the book by Goldstine [1977]. For more recent history, see 
Bultheel and Cools, eds. [2010]. 

The related areas of Scientific Computing and Parallel Computing are rather 
more recent fields of study. Basic introductory texts are Scott et al. [2005] 
and Tveito and Winter [2009]. Texts relevant to linear algebra and differential 
equations are Schendel [1984], Ortega and Voigt [1985], Ortega [1989], Golub 
and Ortega [1992], [1993], Van de Velde [1994], Burrage [1995], Heath [1997], 
Deuflhard and Bornemann [2002], O’Leary [2009], and Quarteroni et al. [2010]. 
Other texts address topics in optimization, Pardalos et al. [1992] and Gonnet 
and Scholl [2009]; computational geometry, Akl and Lyons [1993]; and other 
miscellaneous areas, Crandall [1994], [1996], K6ckler [1994], Bellomo and Preziosi 
[1995], Danaila et al. [2007], and Farin and Hansford [2008]. Interesting historical 
essays are contained in Nash, ed. [1990]. Matters regarding the Complexity of 
numerical algorithms are discussed in an abstract framework in books by Traub and 
Wozniakowski [1980] and Traub, Wasilkowski, and WoZniakowski [1983], [1988], 
with applications to the numerical integration of functions and nonlinear equations, 
and similarly, applied to elliptic partial differential equations and integral equations, 
in the book by Werschulz [1991]. Other treatments are those by Kronsjé6 [1987], Ko 
[1991], Bini and Pan [1994], Wang et al. [1994], Traub and Werschulz [1998], Ritter 
[2000], and Novak et al. [2009]. For an in-depth complexity analysis of Newton’s 
method, the reader is encouraged to study Smale’s [1987] lecture. 

Material on Computational Number Theory can be found, at the undergraduate 
level, in the book by Rosen [2000], which also contains applications to cryptography 
and computer science, and in Allenby and Redfern [1989], and at a more advanced 
level in the books by Niven et al. [1991], Cohen [1993], and Bach and Shallit 
[1996]. Computational methods of factorization are dealt with in the book by 
Riesel [1994]. Other useful sources are the set of lecture notes by Pohst [1993] 
on algebraic number theory algorithms, and the proceedings volumes edited by 
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Pomerance [1990] and Gautschi [1994a, Part II]. For algorithms in Combinatorics, 
see the books by Nijenhuis and Wilf [1978], Hu and Shing [2002], and Cormen et 
al. [2009]. Various aspects of Computer Algebra are treated in the books by Geddes 
et al. [1992], Mignotte [1992], Davenport et al. [1993], Mishra [1993], Heck [2003], 
and Cox et al. [2007]. 

Other relatively new disciplines are Computational Geometry and Geometric 
Modeling, Computer-Aided Design, and Computational Topology, for which rel- 
evant texts are, respectively, Preparata and Shamos [1985], Edelsbrunner [1987], 
Mantyla [1988], Taylor [1992], McLeod and Baart [1998], Gallier [2000], Cohen et 
al. [2001], and Salomon [2006]; Hoschek and Lasser [1993], Farin [1997], [1999], 
and Prautsch et al. [2002]; Edelsbrunner [2006], and Edelsbrunner and Harer [2010]. 
Statistical Computing is covered in general textbooks such as Kennedy and Gentle 
[1980], Anscombe [1981], Maindonald [1984], Thisted [1988], Monahan [2001], 
Gentle [2009], and Lange [2010]. More specialized texts are Devroye [1986] and 
Hormann et al. [2004] on the generation of nonuniform random variables, Spath 
[1992] on regression analysis, Heiberger [1989] on the design of experiments, 
Stewart [1994] on Markov chains, Xiu [2010] on stochastic computing and uncer- 
tainty quantification, and Fang and Wang [1994], Manno [1999], Gentle [2003], 
Liu [2008], Shonkwiler and Mendivil [2009], and Lemieux [2009] on Monte Carlo 
and number-theoretic methods. Numerical techniques in Optimization (including 
optimal control problems) are discussed in Evtushenko [1985]. An introductory 
book on unconstrained optimization is Wolfe [1978]; among more advanced and 
broader texts on optimization techniques we mention Gill et al. [1981], Ciarlet 
[1989], and Fletcher [2001]. Linear programming is treated in Nazareth [1987] and 
Panik [1996], linear and quadratic problems in Sima [1996], and the application of 
conjugate direction methods to problems in optimization in Hestenes [1980]. The 
most comprehensive text on (numerical and applied) Complex Analysis is the three- 
volume treatise by Henrici [1988, 1991, 1986]. Numerical methods for conformal 
mapping are also treated in Kythe [1998], Schinzinger and Laura [2003], and 
Papamichael and Stylianopoulos [2010]. For approximation in the complex domain, 
the standard text is Gaier [1987]; Stenger [1993] deals with approximation by 
sinc functions, Stenger [2011] providing some 450 Matlab programs. The book by 
Iserles and Nogrsett [1991] contains interesting discussions on the interface between 
complex rational approximation and the stability theory of discretized differential 
equations. The impact of high-precision computation on problems and conjectures 
involving complex approximation is beautifully illustrated in the set of lectures by 
Varga [1990]. 

For an in-depth treatment of many of the preceding topics, also see the four- 
volume work of Knuth [1975, 1981, 1973, 2005-2006]. 

Perhaps the most significant topic omitted in our book is numerical linear algebra 
and its application to solving partial differential equations by finite difference or 
finite element methods. Fortunately, there are many treatises available that address 
these areas. For Numerical Linear Algebra, we refer to the classic work of Wilkinson 
[1988] and the book by Golub and Van Loan [1996]. Links and applications 
of matrix computation to orthogonal polynomials and quadrature are the subject 
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of Golub and Meurant [2010]. Other general texts are Jennings and McKeown 
[1992], Watkins [2002], [2007], Demmel [1997], Trefethen and Bau [1997], Stewart 
[1973], [1998], Meurant [1999], White [2007], Allaire and Kaber [2008], and 
Datta [2010]; Higham [2002], [2008] has a comprehensive treatment of error and 
stability analyses and the first, equally extensive, treatment of the numerics of matrix 
functions. Solving linear systems on vector and shared memory parallel computers 
and the use of linear algebra packages on high-performance computers are discussed 
in Dongarra et al. [1991], [1998]. The solution of sparse linear systems and the 
special data structures and pivoting strategies required in direct methods are treated 
in Osterby and Zlatev [1983], Duff et al. [1989], Zlatev [1991], and Davis [2006], 
whereas iterative techniques are discussed in the classic texts by Young [2003] 
and Varga [2000], and in Il’in [1992], Hackbusch [1994], Weiss [1996], Fischer 
[1996], Brezinski [1997], Greenbaum [1997], Saad [2003], Broyden and Vespucci 
[2004], Hageman and Young [2004], Meurant [2006], Chan and Jin [2007], Byrne 
[2008], and WoZnicki [2009]. The books by Branham [1990] and Bjérck [1996] 
are devoted especially to least squares problems. For eigenvalues, see Chatelin 
[1983], [1993], and for a good introduction to the numerical analysis of symmetric 
eigenvalue problems, see Parlett [1998]. The currently very active investigation of 
large sparse symmetric and nonsymmetric eigenvalue problems and their solution 
by Lanczos-type methods has given rise to many books, for example, Cullum and 
Willoughby [1985], [2002], Meyer [1987], Sehmi [1989], and Saad [1992]. For 
structured and symplectic eigenvalue problems, see Fassbender [2000] and Kressner 
[2005], and for inverse eigenvalue problems, Xu [1998] and Chu and Golub [2005]. 
For readers wishing to test their algorithms on specific matrices, the collection of 
test matrices in Gregory and Karney [1978] and the “matrix market” on the Web 
(http://math.nist.gov. /MatrixMarket) are useful sources. 

Even more extensive is the textbook literature on the numerical solution of Par- 
tial Differential Equations. The field has grown so much that there are currently only 
a few books that attempt to cover the subject more or less as a whole. Among these 
are Birkhoff and Lynch [1984] (for elliptic problems), Hall and Porsching [1990], 
Ames [1992], Celia and Gray [1992], Larsson and Thomée [2003], Quarteroni and 
Valli [1994], Morton and Mayers [2005], Sewell [2005], Quarteroni [2009], and 
Tveito and Winter [2009]. Variational and finite element methods seem to have 
attracted the most attention. An early and still frequently cited reference is the 
book by Ciarlet [2002] (a reprint of the 1978 original); among more recent texts 
we mention Beltzer [1990] (using symbolic computation), Krizek and Neittaanmaki 
[1990], Brezzi and Fortin [1991], Schwab [1998], Kwon and Bang [2000] (using 
Matlab), Zienkiewicz and Taylor [2000], Axelsson and Barker [2001], Babuska 
and Strouboulis [2001], Hollig [2003], Monk [2003] (for Maxwell’s equation), 
Ern and Guermonde [2004], Kythe and Wei [2004], Reddy [2004], Chen [2005], 
Elman et al. [2005], Thomée [2006] (for parabolic equations), Braess [2007], 
Demkowicz [2007], Brenner and Scott [2008], Bochev and Gunzburger [2009], 
Efendiev and Hou [2009], and Johnson [2009]. Finite difference methods are treated 
in Ashyralyev and Sobolevskii [1994], Gustafsson et al. [1995], Thomas [1995], 
[1999], Samarskii [2001], Strikwerda [2004], LeVeque [2007], and Gustafsson 
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[2008]; the method of lines in Schiesser [1991]; and the more refined techniques 
of multigrids and domain decomposition in McCormick [1989], [1992], Bramble 
[1993], Shaidurov [1995], Smith et al. [1996], Quarteroni and Valli [1999], Briggs 
et al. [2000], Toselli and Widlund [2005], and Mathew [2008]. Problems in potential 
theory and elasticity are often approached via boundary element methods, for which 
representative texts are Brebbia and Dominguez [1992], Chen and Zhou [1992], 
Hall [1994], and Steinbach [2008]. A discussion of conservation laws is given in the 
classic monograph by Lax [1973] and more recently in Le Veque [1992], Godlewski 
and Raviart [1996], Kréner [1997], and LeVeque [2002]. Spectral methods, i.e., 
expansions in (typically) orthogonal polynomials, applied to a variety of problems, 
were pioneered in the monograph by Gottlieb and Orszag [1977] and have received 
extensive treatments in more recent texts by Canuto et al. [1988], [2006], [2007], 
Fornberg [1996], Guo [1998], Trefethen [2000] (in Matlab), Boyd [2001], Peyret 
[2002], Hesthaven et al. [2007], and Kopriva [2009]. 

Early, but still relevant, texts on the numerical solution of Integral Equations are 
Atkinson [1976] and Baker [1977]. More recent treatises are Atkinson [1997] and 
Kythe and Puri [2002]. Volterra integral equations are dealt with by Brunner and van 
der Houwen [1986] and Brunner [2004], whereas singular integral equations are the 
subject of Prossdorf and Silbermann [1991]. 


P4 Journals 


Here we list the major journals (in alphabetical order) covering the areas of 
numerical analysis and mathematical software. 


ACM Transactions on Mathematical Software 

Applied Numerical Mathematics 

BIT Numerical Mathematics 

Calcolo 

Chinese Journal of Numerical Mathematics and Applications 
Computational Mathematics and Mathematical Physics 
Computing 

IMA Journal on Numerical Analysis 

Journal of Computational and Applied Mathematics 
Mathematical Modelling and Numerical Analysis 
Mathematics of Computation 

Numerical Algorithms 

Numerische Mathematik 

SIAM Journal on Numerical Analysis 

SIAM Journal on Scientific Computing 


Chapter 1 
Machine Arithmetic and Related Matters 


The questions addressed in this first chapter are fundamental in the sense that 
they are relevant in any situation that involves numerical machine computation, 
regardless of the kind of problem that gave rise to these computations. In the first 
place, one has to be aware of the rather primitive type of number system available 
on computers. It is basically a finite system of numbers of finite length, thus a far cry 
from the idealistic number system familiar to us from mathematical analysis. The 
passage from a real number to a machine number entails rounding, and thus small 
errors, called roundoff errors. Additional errors are introduced when the individual 
arithmetic operations are carried out on the computer. In themselves, these errors 
are harmless, but acting in concert and propagating through a lengthy computation, 
they can have significant — even disastrous — effects. 

Most problems involve input data not representable exactly on the computer. 
Therefore, even before the solution process starts, simply by storing the input in 
computer memory, the problem is already slightly perturbed, owing to the necessity 
of rounding the input. It is important, then, to estimate how such small perturbations 
in the input affect the output, the solution of the problem. This is the question of 
the (numerical) condition of a problem: the problem is called well conditioned if the 
changes in the solution of the problem are of the same order of magnitude as the 
perturbations in the input that caused those changes. If, on the other hand, they 
are much larger, the problem is called ill conditioned. It is desirable to measure by 
a single number — the condition number of the problem — the extent to which the 
solution is sensitive to perturbations in the input. The larger this number, the more 
ill conditioned the problem. 

Once the solution process starts, additional rounding errors will be committed, 
which also contaminate the solution. The resulting errors, in contrast to those 
caused by input errors, depend on the particular solution algorithm. It makes sense, 
therefore, to also talk about the condition of an algorithm, although its analysis is 
usually quite a bit harder. The quality of the computed solution is then determined 
by both (essentially the product of) the condition of the problem and the condition 
of the algorithm. 
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2 1 Machine Arithmetic and Related Matters 
1.1. Real Numbers, Machine Numbers, and Rounding 


We begin with the number system commonly used in mathematical analysis and 
confront it with the more primitive number system available to us on any particular 
computer. We identify the basic constant (the machine precision) that determines 
the level of precision attainable on such a computer. 


1.1.1 Real Numbers 


One can introduce real numbers in many different ways. Mathematicians favor 
the axiomatic approach, which leads them to define the set of real numbers as a 
“complete Archimedean ordered field.” Here we adopt a more pedestrian attitude 
and consider the set of real numbers R to consist of positive and negative numbers 
represented in some appropriate number system and manipulated in the usual 
manner known from elementary arithmetic. We adopt here the binary number 
system, since it is the one most commonly used on computers. Thus, 


x ER iff x = 4 (b,2" + dpi" + +9 + O-127| +27? ++). (LD) 
Here n > 0 is some integer, and the “binary digits” b; are either 0 or 1, 
b; =0 or bj =1 forall i. (1.2) 


It is important to note that in general we need infinitely many binary digits to 
represent a real number. We conveniently write such a number in the abbreviated 
form (familiar from the decimal number system) 


aS + (bybn—1 +++ bo . by b_2b_-3+++)2, (1.3) 


where the subscript 2 at the end is to remind us that we are dealing with a binary 
number. (Without this subscript, the number could also be read as a decimal number, 
which would be a source of ambiguity.) The dot in (1.3) — appropriately called the 
binary point — separates the integer part on the left from the fractional part on the 
right. Note that representation (1.3) is not unique, for example, (0.0111...)2 = 
(0.1)2. We regain uniqueness if we always insist on a finite representation, if one 
exists. 


1 
Examples. 1. (10011.01), = 244+ 2!'+2°+2-2=16+2+1+ a (19.25)10 


00 ore) l (ove) 1 m 
2 (010 j= Yo" =} 9 = Z Ds (;) 
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m=1 m=0 
(k even) 
eee = + = 0333 ) 
ae eee ae : -- S10 
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1 —_— 
3. = = (0.2)10 = (0.0011001T. ..) 


To determine the binary digits on the right, one keeps multiplying by 2 and 
observing the integer part in the result; if it is zero, the binary digit in question 
is 0, otherwise 1. In the latter case, the integral part is removed and the process 
repeated. 


The last example is of interest insofar as it shows that to a finite decimal number 
there may correspond a (nontrivial) infinite binary representation. One cannot 
assume, therefore, that a finite decimal number is exactly representable on a binary 
computer. Conversely, however, to a finite binary number there always corresponds 
a finite decimal representation. (Why?) 


1.1.2. Machine Numbers 


There are two kinds of machine numbers: floating point and fixed point. The first 
corresponds to the “scientific notation” in the decimal system, whereby a number is 
written as a decimal fraction times an integral power of 10. The second allows only 
for fractions. On a binary computer, one consistently uses powers of 2 instead of 10. 
More important, the number of binary digits, both in the fraction and in the exponent 
of 2 (if any), is finite and cannot exceed certain limits that are characteristics of the 
particular computer at hand. 


1.1.2.1 Floating-Point Numbers 


We denote by ¢ the number of binary digits allowed by the computer in the fractional 
part and by s the number of binary digits in the exponent. Then the set of (real) 
floating-point numbers on that computer will be denoted by R(t, s). Thus, 


x ER(t,s) iff x = f-2°, (1.4) 
where, in the notation of (1.3), 
Ff =H bb 4::+b4)o, € = £ (Cs-1€5_2+*+€9.)2. (1.5) 


Here all ; and c; are binary digits, that is, either zero or one. The binary fraction 
f is usually referred to as the mantissa of x and the integer e as the exponent of x. 
The number x in (1.4) is said to be normalized if in its fraction f we have b_; = 

We assume that all numbers in R(t, 5) are normalized (with the exception of x = 0, 
which is treated as a special number). If x # 0 were not normalized, we could 
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t bits s_ bits 


Fig. 1.1 Packing of a floating-point number in a machine register 


multiply f by an appropriate power of 2, to normalize it, and adjust the exponent 
accordingly. This is always possible as long as the adjusted exponent is still in the 
admissible range. 

We can think of a floating-point number (1.4) as being accommodated in a 
machine register as shown in Fig. 1.1. The figure does not quite correspond to reality, 
but is close enough to it for our purposes. 

Note that the set (1.4) of normalized floating-point numbers is finite and is thus 
represented by a finite set of points on the real line. What is worse, these points are 
not uniformly distributed (cf. Ex. 1). This, then, is all we have to work with! 

It is immediately clear from (1.4) and (1.5) that the largest and smallest 
magnitude of a (normalized) floating-point number is given, respectively, by 


max |x| =(1—27)27-!, min |x| = 2. (1.6) 
xER(t,s) xER(t,s) 
On a Sun Sparc workstation, for example, one has t = 23, 5 = 7, so that the 


maximum and minimum in (1.6) are 1.70 x 10°8 and 2.94 x 107°, respectively. 
(Because of an asymmetric internal hardware representation of the exponent on 
these computers, the true range of floating-point numbers is slightly shifted, more 
like from 1.18 x 10~*8 to 3.40 x 1038.) Matlab arithmetic, essentially double 
precision, uses ¢ = 53 and s = 10, which greatly expands the number range from 
something like 107° to 103%. 

A real nonzero number whose modulus is not in the range determined by (1.6) 
cannot be represented on this particular computer. If such a number is produced 
during the course of a computation, one says that overflow has occurred if its 
modulus is larger than the maximum in (1.6) and underflow if it is smaller than 
the minimum in (1.6). The occurrence of overflow is fatal, and the machine (or its 
operating system) usually prompts the computation to be interrupted. Underflow 
is less serious, and one may get away with replacing the delinquent number by 
zero. However, this is not foolproof. Imagine that at the next step the number that 
underflowed is to be multiplied by a huge number. If the replacement by zero has 
been made, the result will always be zero. 

To increase the precision, one can use two machine registers to represent a 
machine number. In effect, one then embeds R(t,s) C R(2t,s), and calls x € 
R(2t, 5) a double-precision number. 
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Fig. 1.2 Packing of a fixed-point number in a machine register 


1.1.2.2 Fixed-Point Numbers 


This is the case (1.4) where e = 0. That is, fixed-point numbers are binary fractions, 
x = f, hence |f| < 1. We can therefore only deal with numbers that are in 
the interval (—1,1). This, in particular, requires extensive scaling and rescaling to 
make sure that all initial data, as well as all intermediate and final results, lie in that 
interval. Such a complication can only be justified in special circumstances where 
machine time and/or precision is at a premium. Note that on the same computer as 
considered before, we do not need to allocate space for the exponent in the machine 
register, and thus have in effect s +t binary digits available for the fraction f, hence 
more precision; cf. Fig. 1.2. 


1.1.2.3. Other Data Structures for Numbers 


Complex floating-point numbers consist of pairs of real floating-point numbers, 
the first of the pair representing the real part and the second the imaginary part. 
To avoid rounding errors in arithmetic operations altogether, one can employ 
rational arithmetic, in which each (rational) number is represented by a pair 
of extended-precision integers — the numerator and denominator of the rational 
number. The Euclidean algorithm is used to remove common factors. A device 
that allows keeping track of error propagation and the influence of data errors is 
interval arithmetic involving intervals guaranteed to contain the desired numbers. In 
complex arithmetic, one employs rectangular or circular domains. 


1.1.3 Rounding 


A machine register acts much like the infamous Procrustes bed in Greek mythology. 
Procrustes was the innkeeper whose inn had only beds of one size. If a fellow came 
along who was too tall to fit into his beds, he cut off his feet. If the fellow was too 
short, he stretched him. In the same way, if a real number comes along that is too 
long, its tail end (not the head) is cutoff; if it is too short, it is padded by zeros at 
the end. 
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More specifically, let 


xeR, x=+ (3.2) Pe (1.7) 
k=1 


be the “exact” real number (in normalized floating-point form) and 


t 
x* €R(t,s), x*=+ (y2") 2°" (1.8) 


k=1 


the rounded number. One then distinguishes between two methods of rounding, the 
first being Procrustes’ method. 


(a) Chopping. One takes 
x* =chop(x), e* =e, b*, =b_, for k =1,2,...,t. (1.9) 


(b) Symmetric rounding. This corresponds to the familiar rounding up or rounding 
down in decimal arithmetic, based on the first discarded decimal digit: if it is 
larger than or equal to 5, one rounds up; if it is less than 5, one rounds down. In 
binary arithmetic, the procedure is somewhat simpler, since there are only two 
possibilities: either the first discarded binary digit is 1, in which case one rounds 
up, or it is 0, in which case one rounds down. We can write the procedure very 
simply in terms of the chop operation in (1.9): 


1 
x* =rd(x), rd(x) := chop (« + oa 27. 2) : (1.10) 


There is a small error incurred in rounding, which is most easily estimated in the 
case of chopping. Here the absolute error |x — x*| is 


lo e) 
|x — chop(x)| = J+ > Ba ae 
k=t+1 
[oe 
a yo rae ee 
k=t+1 


It depends on e (i.e., the magnitude of x), which is the reason why one prefers the 
relative error |(x —x*)/x| (if x 4 0), which, for normalized x, can be estimated as 
x — chop(x) 2-28 7 QF. 28 


oo — lige 
2 pga 
k=1 


< 


=2-27, (1.11) 


o 2 
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Similarly, in the case of symmetric rounding, one finds (cf. Ex. 6) 


x —1d(x) 
x 


<7. (1.12) 


The number on the right is an important, machine-dependent quantity, called the 
machine precision (or unit roundoff), 


eps = 2°; (1.13) 


it determines the level of precision of any large-scale floating-point computation. 
In Matlab double-precision arithmetic, one has tf = 53, so that eps ~ 1.11 x 107'° 
(cf. Ex. 5), corresponding to a precision of 15—16 significant decimal digits. 

Since it is awkward to work with inequalities, one prefers writing (1.12) 
equivalently as an equality, 


rd(x) = x(1 +e), |e| < eps, (1.14) 


and defers dealing with the inequality (for e) to the very end. 


1.2. Machine Arithmetic 


The arithmetic used on computers unfortunately does not respect the laws of 
ordinary arithmetic. Each elementary floating-point operation, in general, generates 
a small error that may then propagate through subsequent machine operations. As 
a rule, this error propagation is harmless, except in the case of subtraction, where 
cancellation effects may seriously compromise the accuracy of the results. 


1.2.1 A Model of Machine Arithmetic 


Any of the four basic arithmetic operations, when applied to two machine numbers, 
may produce a result no longer representable on the computer. We have therefore 
errors also associated with arithmetic operations. Barring the occurrence of overflow 
or underflow, we may assume as a model of machine arithmetic that each arithmetic 
operation o (= +,-—, x, /) produces a correctly rounded result. Thus, if x,y € 
R(t, s) are floating-point machine numbers, and fi(xoy) denotes the machine- 
produced result of the arithmetic operation xoy, then 


fi(xoy) = xoy(1 +e), |e] < eps. (1.15) 
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This can be interpreted in a number of ways, for example, in the case of 
multiplication, 


fi(xx y)=[xd+e]xy=xx[yd+e]=(@vl+e)xOvit+e)=-::. 


In each equation we identify the computed result as the exact result on data that are 
slightly perturbed, whereby the respective relative perturbations can be estimated, 
for example, by |e| < eps in the first two equations, and /1 + ¢€ + 1+ se, Se| < 
Seps in the third. These are elementary examples of backward error analysis, a 
powerful tool for estimating errors in machine computation (cf. also Sect. 1.3). 

Even though a single arithmetic operation causes a small error that can be 
neglected, a succession of arithmetic operations can well result in a significant error, 
owing to error propagation. It is like the small microorganisms that we all carry in 
our bodies: if our defense mechanism is in good order, the microorganisms cause no 
harm, in spite of their large presence. If for some reason our defenses are weakened, 
then all of a sudden they can play havoc with our health. The same is true in machine 
computation: the rounding errors, although widespread, will cause little harm unless 
our computations contain some weak spots that allow rounding errors to take over 
to the point of completely invalidating the results. We learn about one such weak 
spot (indeed the only one) in the next section. ! 


1.2.2. Error Propagation in Arithmetic Operations: 
Cancellation Error 


We now study the extent to which the basic arithmetic operations propagate 
errors already present in their operands. Previously, in Sect. 1.2.1, we assumed the 


‘Rounding errors can also have significant implications in real life. One example, taken from 
politics, concerns the problem of apportionment: how should the representatives in an assembly, 
such as the US House of Representatives or the Electoral College, be constituted to fairly reflect 
the size of population in the various states? If the total number of representatives in the assembly 
is given, say, A, the total population of the US is P, and the population of State i is p;, then State 
i should be allocated Di 

U 


P. 

representatives. The problem is that r; is not an integer, in general. How then should 7; be rounded 
to an integer r;*? One can think of three natural criteria to be imposed: (1) r;* should be one 
of the two integers closest to r; (“quota condition’). (2) If A is increased, all other things being 
the same, then r;* should not decrease (“house monotonicity”). (3) If p; is increased, the other p 7 
remaining constant, then r;* should not decrease (“population monotonicity”). Unfortunately, there 
is no apportionment method that satisfies all three criteria. There is indeed a case in US history 
when Samuel J. Tilden lost his bid for the presidency in 1876 in favor of Rutherford B. Hayes, 
purely on the basis of the apportionment method adopted on that occasion (which, incidentally, 
was not the one prescribed by law at the time). 


A 


‘ 
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operands to be exact machine-representable numbers and discussed the errors due to 
imperfect execution of the arithmetic operations by the computer. We now change 
our viewpoint and assume that the operands themselves are contaminated by errors, 
but the arithmetic operations are carried out exactly. (We already know what to do, 
cf. (1.15), when we are dealing with machine operations.) Our interest is in the 
errors in the results caused by errors in the data. 


(a) 


(b) 


Multiplication. We consider values x(1 + e€,) and y(1 + é,) of x and y 
contaminated by relative errors €, and €,, respectively. What is the relative error 
in the product? We assume é,, € sufficiently small so that quantities of second 
order, e, ExEy, &, — and even more so, quantities of still higher order — can be 
neglected against the epsilons themselves. Then 


x(_+ex)- ydte)=x-ydte, +e) + exéy) x- y+ ex + 6y). 
Thus, the relative error €,., in the product is given (at least approximately) by 
Ex-y = Ex + Ey, (1.16) 
that is, the (relative) errors in the data are being added to produce the (relative) 
error in the result. We consider this to be acceptable error propagation, and in 


this sense, multiplication is a benign operation. 
Division. Here we have similarly (if y 4 0) 


x(1+e.) x 2 
oa Al Jd-e, ae ee 
Gnas) 7 ey Ml ey ey = > ) 


X 
~ -—( + Ex — Ey), 
y 
that is, 


Ex/y = Ex — Ey. (1.17) 


Also division is a benign operation. 


(c) Addition and subtraction. Since x and y can be numbers of arbitrary signs, it 


suffices to look at addition. We have 


x +e)+yUdt+e,) =x+yt+ xe, + yey 


X€x + > ) 


=(+y (14 ea 


assuming x + y # 0. Therefore, 


ae ae ee ee (1.18) 
x+y x+y - 


Exty = 
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{ 
r=[1 011001016 b gggqg € 
y=[ 1 01100101 ggg g € 


e-y=[0 00000000" MW ggg g\[ e 


bY bY g ggg??? 27777? 2? 


Fig. 1.3. The cancellation phenomenon 


As before, the error in the result is a linear combination of the errors in the 
data, but now the coefficients are no longer +1 but can assume values that are 
arbitrarily large. Note first, however, that when x and y have the same sign, then 
both coefficients are positive and bounded by 1, so that 


lexty| S lex] + ley] (+ y > 0); (1.19) 


addition, in this case, is again a benign operation. It is only when x and y have 
opposite signs that the coefficients in (1.18) can be arbitrarily large, namely, when 
|x + y| is arbitrarily small compared to |x| and | y|. This happens when x and y are 
almost equal in absolute value, but opposite in sign. The large magnification of error 
then occurring in (1.18) is referred to as cancellation error. It is the only serious 
weakness — the Achilles heel, as it were — of numerical computation, and it should 
be avoided whenever possible. In particular, one should be prepared to encounter 
cancellation effects not only in single devastating amounts, but also repeatedly over 
a long period of time involving “small doses” of cancellation. Either way, the end 
result can be disastrous. 

We illustrate the cancellation phenomenon schematically in Fig. 1.3, where b, 
b’, b” stand for binary digits that are reliable, and the g represent binary digits 
contaminated by error; these are often called “garbage” digits. Note in Fig. 1.3 that 
“garbage — garbage = garbage,” but, more importantly, that the final normalization 
of the result moves the first garbage digit from the 12th position to the 3rd. 

Cancellation is such a serious matter that we wish to give a number of elementary 
examples, not only of its occurrence, but also of how it might be avoided. 


Examples. 1. An algebraic identity: (a — b)? = a* — 2ab + b?. Although this is 
a valid identity in algebra, it is no longer valid in machine arithmetic. Thus, on 
a 2-decimal-digit computer, with a = 1.8,b = 1.7, we get, using symmetric 
rounding, 


fl(a? — 2ab +b”) = 3.2—6.242.9 =-0.10 


instead of the true result 0.010, which we obtain also on our 2-digit computer 
if we use the left-hand side of the identity. The expanded form of the square 
thus produces a result which is off by one order of magnitude and on top has 
the wrong sign. 
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2. Quadratic equation: x” — 56x + 1 = 0. The usual formula for a quadratic gives, 
in 5-decimal arithmetic, 


xX) = 28 — V783 = 28 — 27.982 = 0.018000, 


Xp = 28+ V783 = 28 + 27.982 = 55.982. 


This should be contrasted with the exact roots 0.0178628...and 55.982137... . 
As can be seen, the smaller of the two is obtained to only two correct decimal 
digits, owing to cancellation. An easy way out, of course, is to compute x2 first, 
which involves a benign addition, and then to compute x; = 1/x2 by Vieta’s 
formula, which again involves a benign operation — division. In this way we 
obtain both roots to full machine accuracy. 

3. Compute y = Vx + 6 — ./x, where x > 0 and |6| is very small. Clearly, the 
formula as written causes severe cancellation errors, since each square root has 
to be rounded. Writing instead 


5 
a Vx +6+/x 


completely removes the problem. 
4. Compute y = cos(x + 6) — cosx, where |6| is very small. Here cancellation 
can be avoided by writing y in the equivalent form 


oe é ft) 
y= sin 5 sin (x +5). 
5. Compute y = f(x +6) — f(x), where |4| is very small and f a given function. 
Special tricks, such as those used in the two preceding examples, can no longer 
be played, but if f is sufficiently smooth in the neighborhood of x, we can use 
Taylor expansion: 


y= POET SIO to 


The terms in this series decrease rapidly when |6| is small so that cancellation 
is no longer a problem. 


Addition is an example of a potentially ill-conditioned function (of two vari- 
ables). It naturally leads us to study the condition of more general functions. 


1.3. The Condition of a Problem 


A problem typically has an input and an output. The input consists of a set of 
data, say, the coefficients of some equation, and the output of another set of 
numbers uniquely determined by the input, say, all the roots of the equation in some 
prescribed order. If we collect the input in a vector x € R” (assuming the data 
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x ——_____ > P [> y 


Fig. 1.4 Black box representation of a problem 


consist of real numbers), and the output in the vector y € R” (also assumed real), 
we have the black box situation shown in Fig. 1.4, where the box P accepts some 
input x and then solves the problem for this input to produce the output y. 

We may thus think of a problem as a map f, given by 


f: R° SR’, y= f(x). (1.20) 


(One or both of the spaces R’”, IR” could be complex spaces without changing in any 
essential way the discussion that follows.) What we are interested in is the sensitivity 
of the map f at some given point x to a small perturbation of x, that is, how much 
bigger (or smaller) the perturbation in y is compared to the perturbation in x. In 
particular, we wish to measure the degree of sensitivity by a single number — the 
condition number of the map f at the point x. We emphasize that, as we perturb 
x, the function f is always assumed to be evaluated exactly, with infinite precision. 
The condition of f, therefore, is an inherent property of the map f and does not 
depend on any algorithmic considerations concerning its implementation. 

This is not to say that knowledge of the condition of a problem is irrelevant 
to any algorithmic solution of the problem. On the contrary, the reason is that 
quite often the computed solution y* of (1.20) (computed in floating-point machine 
arithmetic, using a specific algorithm) can be demonstrated to be the exact solution 
to a “nearby” problem, that is, 


y = f(x’), (1.21) 


where x* is a vector close to the given data x, 


* 


x“=x-4+6, (1.22) 


and moreover, the distance ||6|| of x* to x can be estimated in terms of the machine 
precision. Therefore, if we know how strongly (or weakly) the map f reacts to a 
small perturbation, such as 6 in (1.22), we can say something about the error y* — y 
in the solution caused by this perturbation. This, indeed, is an important technique 
of error analysis — known as backward error analysis — which was pioneered in the 
1950s by J. W. Givens, C. Lanczos, and, above all, J. H. Wilkinson. 

Maps f between more general spaces (in particular, function spaces) have also 
been considered from the point of view of conditioning, but eventually, these spaces 
have to be reduced to finite-dimensional spaces for practical implementation of the 
maps in question. 
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We start with the simplest case of a single function of one variable. 

The caseem =n = 1: y = f(x). Assuming first x 4 0, y # 0, and denoting 
by Ax asmall perturbation of x, we have for the corresponding perturbation Ay by 
Taylor’s formula 


Ay = f(x + Ax)— f(x) © f'(x)Ax, (1.23) 


assuming that f is differentiable at x. Since our interest is in relative errors, we 
write this in the form 
Ay  xf'(x) Ax 
y f(x) x 
The approximate equality becomes a true equality in the limit as Ax — 0. This 
suggests that the condition of f at x be defined by the quantity 


xf") 
f(x) 


This number tells us how much larger the relative perturbation in y is compared to 
the relative perturbation in x. 

If x = O and y ¥ OQ, it is more meaningful to consider the absolute error 
measure for x and for y still the relative error. This leads to the condition number 
| f'(x)/f(x)|. Similarly for y = 0, x 4 0. If x = y = 0, the condition number by 
(1.23) would then simply be | f’(x)|. 

The case of arbitrary m,n: here we write 


(1.24) 


(cond f)(x) := 


; (1.25) 


0 yaya ER, y= Piao.) ER 


and exhibit the map f in component form 
Vy = fo(%, X2,-.-,Xm), V=I,2,...,n. (1.26) 


We assume again that each function /, has partial derivatives with respect to all m 
variables at the point x. Then the most detailed analysis departs from considering 
each component y, as a function of one single variable, x,. In other words, we 
subject only one variable, x,,, to a small change and observe the resulting change in 
just one component, y,. Then we can apply (1.25) and obtain 


af 
H Oxy 


fo(x) 


Vvplx) := (cond, f )(x) = (1.27) 
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This gives us a whole matrix P(x) = [,(x)] € RY" of condition numbers. 
To obtain a single condition number, we can take any convenient measure of the 
“magnitude” of the matrix I(x) such as one of the matrix norms defined in (1.30), 


(cond f)(x) = ||P (x)|], P@) = [ny]. (1.28) 


The condition so defined, of course, depends on the choice of norm, but the order 
of magnitude (and that is all that counts) should be more or less the same for any 
reasonable norm. 
If a component of x, or of y, vanishes, one modifies (1.27) as discussed earlier. 
A less-refined analysis can be modeled after the one-dimensional case by 
defining the relative perturbation of x € IR” to mean 


|| Ax ||am 


, Ax = [Ax,,Axo,..., Axm]", (1.29) 
|| [lem 


where Ax is a perturbation vector whose components Ax,, are small compared to 
X,, and where || - ||1z~ is some vector norm in IR”. For the perturbation A y caused by 
Ax, one defines similarly the relative perturbation || Ay ||” /|| y ||», with a suitable 
vector norm || - ||m in R”. One then tries to relate the relative perturbation in y to 
the one in x. 

To carry this out, one needs to define a matrix norm for matrices A € R"*’”. We 
choose the so-called “operator norm,” 


Ax 
[A llgexm 2= a ee (1.30) 
rea" [x [lan 
x0 


In the following we take for the vector norms the “uniform” (or infinity) norm, 
|x ]zn = max [xy] =: [|X ]loo, |ylley = max |yy| =: [yhoo - (1.31) 
l<p<m l<v<n 


It is then easy to show that (cf. Ex. 32) 
Allien = [A lloo = max ) Tdvy|, A = [ay] €R". (1.32) 
Sven ty 


Now in analogy to (1.23), we have 


m of, 


x 
a 


Ay, = fi(x + Ax) — fi(x) » AXxy 
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with the partial derivatives evaluated at x. Therefore, at least approximately, 


m of, m of, 
|Ay,| <> 9 |Ax,| < max |Ax,,| - > 5 
wat | OXH in jai OXn 
m of, 
s A 
max | Xpl- max X fea 
Since this holds for each v = 1,2,...,m, it also holds for max |Ay,|, giving, in 
view of (1.31) and (1.32), . 
JAyle = [Aste [SA] (1.33) 
Here, 
af ah 
OX, 0X2 OXm 
if, th ah 
of = Ox] 0x2 OXm é R'*%™ (1.34) 
ox 
fn fn A fn 
OX, 0X2 OXm 


is the Jacobian matrix of f . (This is the analogue of the first derivative for systems 
of functions of several variables.) From (1.33) one now immediately obtains for the 
relative perturbations 


IAVlloo — Ill AL /8XIlo0 | |Axllco 
IIylloo ~ WF )Iheo IIX lloo 


Although this is an inequality, it is sharp in the sense that equality can be achieved 
for a suitable perturbation Ax. We are justified, therefore, in defining a global 
condition number by 


[x lool F /@2lloo 
d = ; 1.35 
ee les eeu 


Clearly, in the case m = n = 1, definition (1.35) reduces precisely to definition 
(1.25) (as well as (1.28)) given earlier. In higher dimensions (m and/or n larger than 
1), however, the condition number in (1.35) is much cruder than the one in (1.28). 
This is because norms tend to destroy detail: if x, for example, has components 
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of vastly different magnitudes, then ||x||,o is simply equal to the largest of these 
components, and all the others are ignored. For this reason, some caution is required 
when using (1.35). 

To give an example, consider 


The components of the condition matrix T(x) in (1.27) are then 


X2 


xX, + Xo 


x) x2 xX) 


Yu = »Y2= »Ya= » Y22 = 


’ 


X1 + X2 xX2—X1 xX2— X1 
indicating ill-conditioning if either x; ~ x2 or x, *% —X, and |x| (hence also |x>|) 


is not small. The global condition number (1.35), on the other hand, since 


of 1 Mp: 8 
ag OO) == 2 , 
ox X{X5 
2 2 
xy HX] 


becomes, when L, vector and matrix norms are used (cf. Ex. 33), 


2 
Ilx |]: « S— max(xy, x3) 
(cond f)(x) = X13 _ 5 ll + |2| max(x?, x3) 
xX Xy + X2| + [x1 — x2 
Tag (tH al + ber — zal) |xix2| [ay + x2] + [a1 — 29 
X1X2 


Here x; % X2 or X1 % —X2 yields (cond f')(x) ~ 2, which is obviously misleading. 


1.3.2. Examples 


We illustrate the idea of numerical condition in a number of examples, some of 
which are of considerable interest in applications. 


1 n 
1. Compute J, = i: 
o t¢+5 


example here deals with a map from the integers to reals and therefore does 


dt for some fixed integer n > 1. As it stands, the 
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Vo —— >| Sn -_—___= Jn 


Fig. 1.5 Black box for recursion (1.38) 


not fit our concept of “problem” in (1.20). However, we propose to compute J, 
recursively by relating J; to 7,-; and noting that 


haf & ie ine (1.36) 
= —_ = in = in-. ' 
: £5 5 


To find the recursion, observe that 
t 1 oS] 
t45 es 


Thus, multiplying both sides by t‘~! and integrating from 0 to 1 yields 
1 
aa k =1,2,...,n. (1.37) 


We see that /; is a solution of the (linear, inhomogeneous, first-order) difference 
equation 


1 
Vie = —Sye-1 + 7p Ko, 2, 33cm (1.38) 


We now have what appears to be a practical scheme to compute /,,: start with 
yo = Io given by (1.36) and then apply in succession (1.38) fork = 1,2,...,75 
then y, = J,. Recursion (1.38), for any starting value yo, defines a function, 


Yn = Sn(Vo)- (1.39) 


We have the black box in Fig. 1.5 and thus a problem f, : R — R. (Here n 
is a parameter.) We are interested in the condition of f, at the point yp = Jo 
given by (1.36). Indeed, Jp in (1.36) is not machine representable and must be 
rounded to J before recursion (1.38) can be employed. Even if no further errors 
are introduced during the recursion, the final result will not be exactly /,,, but 
some approximation /* = f, (Jj), and we have, at least approximately (actually 
exactly; see the remark after (1.46)), 


n 


Th, 


i*-If, 


= (cond f,,) (Uo) (1.40) 


ToT 
In |" 
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To compute the condition number, note that f, is a linear function of yo. 
Indeed, if m = 1, then 


yi = fio) = —S5yo + 1. 
If n = 2, then 
1 9 1 
y= fo(vo) = —5y1 - > = (—5)"yp —5 + 5 i 
and so on. In general, 


Yn = Tro) — (—5)" yo + Pn; 


where p, is some number (independent of yo). There follows 


/ —5)" 
Kon eae voi (Vo) | a ) . (1.41) 
Now, if yo = Jo, then y, = J,, and from the definition of J, as an integral it is 


clear that J,, decreases monotonically in n (and indeed converges monotonically 
to zero as nm — oo). Therefore, 


Io 25! I 25! 
po — 


(cond f,)(Io) = 2 > 


SP (1.42) 


We see that f,(vo) is severely ill-conditioned at yo = Jo, the more so the 
larger n. 

We could have anticipated this result by just looking at the recursion (1.38): 
we keep multiplying by (—5), which tends to make things bigger, whereas 
they should get smaller. Thus, there will be continuous cancellation occurring 
throughout the recursion. 

How can we avoid this ill-conditioning? The clue comes from the remark 
just made: instead of multiplying by a large number, we would prefer dividing 
by a large number, especially if the results get bigger at the same time. This is 
accomplished by reversing recurrence (1.38), that is, by choosing an v > n and 
computing 


==(+ k= 1 +1 (1.43) 
ss a ae i Vk > =v,v sessing ll 7 . 


The problem then, of course, is how to compute the starting value y,. Before we 
deal with this, let us observe that we now have a new black box, as shown in 
Fig. 1.6. 

As before, the function involved, g,, is a linear function of y,, and an 
argument similar to the one leading to (1.41) then gives 
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. ef} | ie 


Fig. 1.6 Black box for the recursion (1.43) 


(cond g,)(yy) = nC) ,v>n, (1.44) 
For y, = J,, we get, again by the monotonicity of J, 
(cond g,)(I,) < (=) ,v>n, (1.45) 
In analogy to (1.40), we now have 
aot = (cond g,)(Iy) oe DN) (=) a fi (1.46) 


where /,* is some approximation of /,. Actually, /* does not even have to be 
close to J, for (1.46) to hold, since the function g,, is linear. Thus, we may take 
I* = 0, committing a 100% error in the starting value, yet obtaining 7* with a 


relative error 
1 v-n 
< (=) , v>nN. (1.47) 


The bound on the right can be made arbitrarily small, say, < ¢, if we choose v 
large enough, for example, 


i*-I, 


n 


Tn 


(1.48) 


The final procedure, therefore, is: given the desired relative accuracy e, choose v 
to be the smallest integer satisfying (1.48) and then compute 


I* =0, 
(1.49) 


[x : I; k 1 +1 
=-{f,-— ; S77, Ve 1. . 
aN 


This will produce a sufficiently accurate [* ~ I[,, even in the presence of 
rounding errors committed in (1.49): they, too, will be consistently attenuated. 
Similar ideas can be applied to the more important problem of computing 
solutions to second-order linear recurrence relations such as those satisfied by 
Bessel functions and many other special functions of mathematical physics. 
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The procedure of backward recurrence is then closely tied up with the theory 
of continued fractions. 

2. Algebraic equations: these are equations involving a polynomial of given 
degree n, 


p(x) =0, p(x) =x" +ayix" | +++) +aix+ao, ao #0. (1.50) 
Let & be some fixed root of the equation, which we assume to be simple, 
p(é) =0, p'(é) #0. (1.51) 
The problem then is to find €, given p. The data vector a = [do,41,....4n—1]" 
€ R” consists of the coefficients of the polynomial p, and the result is €, a real 
or complex number. Thus, we have 


&€: R’ SC, €&= &(a0,q,...,dn-1). (1.52) 


What is the condition of €? We adopt the detailed approach of (1.27) and first 
define 


0g 
ay aq 
yv = (cond, €)(a) = a v=0,1,...,n—1. (1.53) 
Then we take a convenient norm, say, the L; norm ||y||; := = lyv| of the 
vector y = [yo,..-, Yn—i]', to define 
n—-1 
(cond é)(a) = ) (cond, &)(a). (1.54) 
v=0 


To determine the partial derivative of £ with respect to a,, observe that we have 
the identity 


[§ (do, 41,.-+,4n)]" + an—ilE(-++ +oM Gyles) =e ag = 


Differentiating this with respect to a,, we get 


n[E (ao, a1, a aor as + a,—-1(n — 1)[E(- sn yr f.-- 
+ ayvlEe--)P +e tay a + [EJ] = 0, 
ay day 


where the last term comes from differentiating the first factor in the product a,&”. 
The last identity can be written as 
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) 
pias +8" 0. 
dy 


Since p’(&) 4 0, we can solve for 0&/da, and insert the result in (1.53) and 
(1.54) to obtain 


n—-1 


1 
d = —__. 7 an 1.55 
(cond €)(a) eel 2 lay] |é| (1.55) 


We illustrate (1.55) by considering the polynomial p of degree n that has the 
zeros 1,2,...,n, 


p(x) = [[@—v) =x" + a,x" | +++ +a. (1.56) 
v=1 


This is a famous example due to J.H. Wilkinson, who discovered the ill- 
conditioning of some of the zeros almost by accident. If we let &, = w, 
po = 1,2,...,n, it can be shown that 


min cond €, = cond & ~ n? as n — 00, 
LL 


ie i (34) 
max cond &, ~ as Nn —> oO. 
B (2- v2) xn J2-1 


The worst-conditioned root is &,,, with j1o the integer closest to n/ /2, when n is 
large. Its condition number grows like (5.828 ...)”, thus exponentially fast in n. 
For example, when n = 20, then cond &,,. = 0.540 x io**. 

The example teaches us that the roots of an algebraic equation written in the 
form (1.50) can be extremely sensitive to small changes in the coefficients a,,. It 
would, therefore, be ill-advised to express every polynomial in terms of powers, 
as in (1.56) and (1.50). This is particularly true for characteristic polynomials of 
matrices. It is much better here to work with the matrices themselves and try to 
reduce them (by similarity transformations) to a form that allows the eigenvalues 
— the roots of the characteristic equation — to be read off relatively easily. 

3. Systems of linear algebraic equations: given a nonsingular square matrix A € 
R”’*", and a vector b € R", the problem now discussed is solving the system 


Ax =b. (1.57) 


Here the data are the elements of A and b, and the result the vector x. The map in 
question is thus R" +" _, RB" To simplify matters, let us assume that A is a fixed 
matrix not subject to change, and only the vector b is undergoing perturbations. 
We then havea map f : R” — R” given by 


x = f(b) := Aq'b. 
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It is in fact a linear map. Therefore, df /db = A7!, and we get, using (1.35), 


[||| Aa" 


(cond f)(6) = ae 


(1.58) 


where we may take any vector norm in R” and associated matrix norm 
(cf. (1.30)). We can write (1.58) alternatively in the form 


|Axl] Aa" 


(cond f)(b) = 
IIx 


(where Ax = b), 


and since there is a one-to-one correspondence between x and b, we find for the 
worst condition number 


max (cond f )(b) = ma xo : 


pe 


“||AT = Al AMT, 


by definition of the norm of A. The number on the far right no longer depends on 
the particular system (i.e., on b) and is called the condition number of the matrix 
A. We denote it by 


cond A := |[Al|-|[A7]] . (1.59) 


It should be clearly understood, though, that it measures the condition of a linear 
system with coefficient matrix A, and not the condition of other quantities that 
may depend on A, such as eigenvalues. 

Although we have considered only perturbations in the right-hand vector B, it 
turns out that the condition number in (1.59) is also relevant when perturbations 
in the matrix A are allowed, provided they are sufficiently small (so small, for 
example, that || A.A] - || A7!|| < 1). 

We illustrate (1.59) by several examples. 


(a) Hilbert” matrix: 


David Hilbert (1862-1943) was the most prominent member of the Géttingen school of mathe- 
matics. Hilbert’s fundamental contributions to almost all parts of mathematics — algebra, number 
theory, geometry, integral equations, calculus of variations, and foundations — and in particular the 
23 now famous problems he proposed in 1900 at the International Congress of Mathematicians 
in Paris gave a new impetus, and new directions, to 20th-century mathematics. Hilbert is also 
known for his work in mathematical physics, where among other things he formulated a variation 
principle for Einstein’s equations in the theory of relativity. 
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1 1 
1 = = 
2 n 
1 4 1 
H,=| 2? 3 tal | epee, (1.60) 
i 1 
n n+l 2n-1 


This is clearly a symmetric matrix, and it is also positive definite. Some 
numerical values for the condition number of H,, computed with the 


Euclidean norm,’ are shown in Table 1.1. Their rapid growth is devastating. 


Table 1.1 The condition of 
Hilbert matrices 


n cond, Hf, 

10 1.60 x 10/3 
20 2.45 x 108 
40 7.65 x 10°8 


A system of order n = 10, for example, cannot be solved with any reliability 
in single precision on a 14-decimal computer. Double precision will be 
“exhausted” by the time we reach n= 20. The Hilbert matrix thus is a 
prototype of an ill-conditioned matrix. From a result of G. Szegé it can be 
seen that 


4n+4 
(v2 + 1) 
7215/4 /rn 


(b) Vandermonde* matrices: these are matrices of the form 


cond), ~ as n> ©. 


3We have cond), = Amax(Hn) * Amax(H me ), where Ainax(A) denotes the largest eigenvalue of 
the (symmetric, positive definite) matrix A. The eigenvalues of H,, and H i are easily computed 
by the Matlab routine eig, provided that the inverse of H’,, is computed directly from well-known 
formulae (not by inversion); see MA 9. 

4 Alexandre Théophile Vandermonde (1735-1796), a musician by training, but through acquain- 
tance with Fontaine turned mathematician (temporarily), and even elected to the French Academy 
of Sciences, produced a total of four mathematical papers within 3 years (1770-1772). Though 
written by a novice to mathematics, they are not without interest. The first, e.g., made notable 
contributions to the then emerging theory of equations. By virtue of his fourth paper, he is 
regarded as the founder of the theory of determinants. What today is referred to as “Vandermonde 
determinant,” however, does not appear anywhere in his writings. As a member of the Academy, 
he sat in a committee (together with Lagrange, among others) that was to define the unit of length — 
the meter. Later in his life, he became an ardent supporter of the French revolution. 
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1 1 1 
ty t2 nie tn 
V, = , , ; er, (1.61) 
-1 -1 -1 
a a eB 
where ft, fo,...,¢, are parameters, here assumed real. The condition number 


of these matrices, in the oo-norm, has been studied at length. Here are some 
sample results: if the parameters are equally spaced in [—1,1], that is, 


- 2(v — 1) 


t=1 
n—-1 


y= 1.2, 33%; 


then 
1 —n/4.n(4+4In2) 
condgV, ~ —e ear? ,n>o. 
a 
Numerical values are shown in Table 1.2. They are not growing quite as 


fast as those for the Hilbert matrix, but still exponentially fast. Worse than 
exponential growth is observed if one takes harmonic numbers as parameters, 


Table 1.2 The condition of 
Vandermonde matrices 


n condgg Vy, 

10 1.36 x 10 
20 1.05 x 10° 
40 6.93 x 10!8 
80 3.15 x 108 


Then indeed 


cond V, >n"t!, 


Fortunately, there are not many matrices occurring naturally in applications 
that are that ill-conditioned, but moderately to severely ill-conditioned 
matrices are no rarity in real-life applications. 


1.4 The Condition of an Algorithm 


We again assume that we are dealing with a problem f given by 


f: RR’, y=f(x). (1.62) 
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Along with the problem f, we are also given an algorithm A that “solves” the 
problem. That is, given a machine vector x € R(t, s), the algorithm A produces 
a vector y4 (in machine arithmetic) that is supposed to approximate y = f (x). 
Thus, we have another map f, describing how the problem f is solved by the 
algorithm A, 

fa: R's) > Rt5), ya = fale). (1.63) 


In order to be able to analyze f'4 in these general terms, we must make a basic 
assumption, namely, that 


for every x € R’(t,s), there holds 
(1.64) 


fa(x) = f(x4) for some x4 € R”. 


That is, the computed solution corresponding to some input x is the exact solution 
for some different input x4 (not necessarily a machine vector and not necessarily 
uniquely determined) that we hope is close to x. The closer we can find an x4 to 
x, the more confidence we should place in the algorithm A. We therefore define 
the condition of A in terms of the x4 closest to x (if there is more than one), by 
comparing its relative error with the machine precision eps: 


(cond A)(x) = inf i al /eps. (1.65) 


Here the infimum is over all x4 satisfying ys = f (x4). In practice, one can take 
any such x4 and then obtain an upper bound for the condition number: 


(cond A)(x) < Maal /eps. (1.66) 


The vector norm in (1.65), respectively, (1.66), can be chosen as seems convenient. 
Here are some very elementary examples. 


1. Suppose a library routine for the logarithm function y = Inx, for any positive 
machine number x, produces a yy satisfying y4 = [Inx](1 + 6), |e| < Seps. 
What can we say about the condition of the underlying algorithm A? We clearly 
have 

ya =Inx4, where x4 = x't® (uniquely). 


Consequently, 


=X. 


= |x°-1l= je — 1| x jelnx| < 5 |Inx|- eps, 


x 
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and, therefore, (cond A)(x) < 5|Inx|. The algorithm A is well conditioned, 
except in the immediate right-hand vicinity of x = 0 and for x very large. (In 
the latter case, however, x is likely to overflow before A becomes seriously ill- 
conditioned.) 
2. Consider the problem 
f: R SR, y = xX1%2-+- Xp. 


We solve the problem by the obvious algorithm 


Pi=M, 
A: pe = f(x pe-1), K = 2,3,...,0, 
YA = Pn- 
Note that x; is machine representable, since for the algorithm A we assume x € 


R"(t,s). 


Now using the basic law of machine arithmetic (cf. (1.15)), we get 


Pi=%i, 
Pe = Xe Pr-11 + &x), kK =2,3,...,n, |ex| < eps, 


from which 


Dn = X\X2+++X_,(1 + €2)(1 + €3)-+- A + &). 
Therefore, we can take for example (there is no uniqueness), 
Xa = [x1,x2(1 + €2),...,%n(1 + e,)]. 
This gives, using the oo-norm, 


Ix4 = Xlloo _ [N[O,x2€2,.--+XnEn]"lloo — [IX llooeps _ 


|x looeps |X llooeps ~ || llooeps 


and so, by (1.66), (cond A)(x) < 1 forany x € R”(t,s). Our algorithm, to nobody’s 
surprise, is perfectly well conditioned. 
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1.5 Computer Solution of a Problem; Overall Error 


The problem to be solved is again 
f: R" SR’, y= f(x). (1.67) 


This is the mathematical (idealized) problem, where the data are exact real numbers, 
and the solution is the mathematically exact solution. 

When solving such a problem on a computer, in floating-point arithmetic with 
precision eps, and using some algorithm A, one first of all rounds the data, and then 
applies to these rounded data not f, but fy: 
|x" 


i =| 
x* = rounded data, 


lx (1.68) 
Ya = falx*). 


Here ¢ represents the rounding error in the data. (The error ¢ could also be due to 
sources other than rounding, e.g., measurement.) The total error that we wish to 
estimate is then 
va — JI 
Iv | 


By the basic assumption (1.64) made on the algorithm A, and choosing x% 
optimally, we have 


(1.69) 


Ile, — "ll 


—+—_— = (cond A)(x*) - eps. (1.70) 
|| x*|| 


Sax") = f x4), 
Let y* = f (x*). Then, using the triangle inequality, we have 


Ilya — yl P ya = 9", ot = alk ys = I, lt = yl 
lly l lly ll lly l IIy*ll lly ll 


’ 


where we have used the (harmless) approximation || y || ~ || y* ||. By virtue of (1.70), 
we now have for the first term on the right, 


ys — HWA) -—FEYDI _ IF @4)- FEO 
Ily*ll IF Dll IF @*)ll 


(cond f )(x*) - ea = 27 Th anal 
I|x* |] 


= (cond f )(x*) - (cond A)(x*) - eps. 


IA 
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For the second term we have 


= EV FOM Gar pyc IPN ecteane 
Iyl If@l Ei 


Assuming finally that (cond f)(x*) ~ (cond f)(x), we get 


Iya — Jl 


ol (cond f')(x){e + (cond A)(x") - eps}. (1.71) 


This shows how the data error and machine precision contribute toward the total 
error: both are amplified by the condition of the problem, but the latter is further 
amplified by the condition of the algorithm. 


1.6 Notes to Chapter 1 


In addition to rounding errors in the data and those committed during the execution 
of arithmetic operations, there may be other sources of errors not considered in this 
introductory chapter. One such source of error, which is not entirely dismissible, is 
a faulty design of the computer chip that executes arithmetic operations. This was 
brought home in an incident several years ago, when it was discovered (by Thomas 
Nicely in the course of number-theoretic computations involving reciprocals of 
twin primes) that the Pentium floating-point divide chip manufactured by Intel can 
produce erroneous results for certain (extremely rare) bit patterns in the divisor. The 
incident — rightly so — has stirred up considerable concern and prompted not only 
remedial actions but also careful analysis of the phenomenon; some relevant articles 
are those by Coe et al. [1995] and Edelman [1997]. 

Neither should the occurrence of overflow and proper handling thereof be taken 
lightly, especially not in real-time applications. Again, a case in point is the failure of 
the French rocket Ariane 5, which on June 4, 1996, less than a minute into its flight, 
self-destructed. The failure was eventually traced to an overflow in a floating-point 
to integer conversion and lack of protection against this occurrence in the rocket’s 
on-board software (cf. Anonymous [1996]). 

Many of the topics covered in this chapter, but also the effect of finite precision 
computation on convergence and stability of mathematical processes, and issues of 
error analyses are dealt with in Chaitin-Chatelin and Frayssé [1996]. 


Section 1.1.1, The abstract notion of the real number system is discussed in most 
texts on real analysis, for example, Hewitt and Stromberg [1975, Chap. 1, Sect. 1.5] 
or Rudin [1976, Chap. 1]. The development of the concept of real (and complex) 
numbers has had a long and lively history, extending from pre-Hellenic times to the 
recent past. Many of the leading thinkers over time contributed to this development. 
A reader interested in a detailed historical account (and who knows German) is 
referred to the monograph by Gericke [1970]. 
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Section 1.1.2.1. The notion of the floating-point number system and associated 
arithmetic, including interval arithmetic, can also be phrased in abstract algebraic 
terms; see, for example, Kulisch and Miranker [1981]. For a comprehensive treat- 
ment of computer arithmetic, including questions of validation, see Kulisch [2008]. 
A more elementary, but detailed, discussion of floating-point numbers and arith- 
metic is given in Sterbenz [1974]. There the reader will learn, for example, that 
computing the average of two floating-point numbers, or solving a quadratic 
equation, can be fairly intricate tasks if they are to be made foolproof. The 
quadratic equation problem is also considered at some length in Young and 
Gregory [1988, Sect.3.4], where further references are given to earlier work of 
W. Kahan and G. E. Forsythe. 

The basic standard for binary floating-point arithmetic, used on all contemporary 
computers, is the ANSI/TEEE Standard 754 established in IEEE [1985]. It provides 
for t¢ = 23 bits in the mantissa and s = 7 bits in the exponent, in single- 
precision arithmetic, and has t = 52, s = 11 in double precision. There is also 
an “extended precision” for which t = 63, s = 14, allowing for a number range 
of approx. 10~4°%4 to 10+ 4°64. A good source for IEEE floating-point arithmetic is 
Overton [2001]. 


Section 1.1.2.3. Rational arithmetic is available in all major symbolic computation 
packages such as Mathematica and Macsyma. 

Interval arithmetic has evolved to become an important tool in computations that 
strive at obtaining guaranteed and sharp inclusion regions for the results of mathe- 
matical problems. Basic texts on (real) interval analysis are Moore [1966], [1979], 
Alefeld and Herzberger [1983], and Moore et al. [2009], whereas complex interval 
arithmetic is treated in Petkovi¢ and Petkovié [1998]. For the newly evolving field 
of validated numerics we refer to Tucker [2011]. Specific applications such as 
computing inclusions of the range of functions, of global extrema of functions 
of one and several variables, and of solutions to systems of linear and nonlinear 
equations are studied, respectively, in Ratschek and Rokne [1984], [1988], Hansen 
and Walster [2004], and Neumaier [1990]. Other applications, e.g., to parameter 
estimation, robust control, and robotics can be found in Jaulin et al. [2001]. 
Concrete algorithms and codes (in Pascal and C++) for “verified computing” are 
contained in Hammer et al. [1993], [1995]. Interval arithmetic has been most widely 
used in processes involving finite-dimensional spaces; for applications to infinite- 
dimensional problems, notably differential equations, see, however, Eijgenraam 
[1981] and Kaucher and Miranker [1984]. For a recent expository account, see also 
Rump [2010]. 


Section 1.2. For floating-point arithmetic, see the handbook by Muller et al. [2010]. 
The fact that thoughtless use of mathematical formulae and numerical methods, or 
inherent sensitivities in a problem, can lead to disastrous results has been known 
since the early days of computers; see, for example, the old but still relevant 
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papers by Stegun and Abramowitz [1956] and Forsythe [1970]. Nearby singularities 
can also cause the accuracy to deteriorate unless corrective measures are taken; 
Forsythe [1958] has an interesting discussion of this. 


Section 1.2.1. For the implications of rounding in the problem of apportion- 
ment, mentioned in footnote 1, a good reference is Garfunkel and Steen, 
eds. [1988, Chap. 12, pp.230-249]. 


Section 1.3.1. An early but basic reference for ideas of conditioning and error anal- 
ysis in algebraic processes is Wilkinson [1994]. An impressive continuation of this 
work, containing copious references to the literature, is Higham [2002]. It analyzes 
the behavior in floating-point arithmetic of virtually all the algebraic processes 
in current use. Problems of conditioning specifically involving polynomials are 
discussed in Gautschi [1984]. The condition of general (differentiable) maps has 
been studied as early as 1966 in Rice [1966]. 


Section 1.3.2. 1. For a treatment of stability aspects of more general difference 
equations, and systems thereof, including nonlinear ones, the reader is referred to 
the monograph by Wimp [1984]. This also contains many applications to special 
functions. Other relevant texts are Lakshmikantham and Trigiante [2002] and 
Elaydi [2005]. 


2. The condition of algebraic equations, although considered already in 1963 by 
Wilkinson, has been further analyzed by Gautschi [1973]. The circumstances 
that led to Wilkinson’s example (1.56), which he himself describes as “the 
most traumatic experience in [his] career as a numerical analyst,’ are related 
in the essay Wilkinson [1984,Sect.2]. This reference also deals with errors 
committed in the evaluation and deflation of polynomials. For the latter, also 
see Cohen [1994]. The asymptotic estimates for the best- and worst-conditioned 
roots in Wilkinson’s example are from Gautschi [1973]. For the computation 
of eigenvalues of matrices, the classic treatment is Wilkinson [1988]; more 
recent accounts are Parlett [1998] for symmetric matrices and Golub and Van 
Loan [1996, Chap. 7-9] for general matrices. 

3. A more complete analysis of the condition of linear systems that also allows for 
perturbations of the matrix can be found, for example, in the very readable books 
by Forsythe and Moler [1967, Chap. 8] and Stewart [1973, Chap. 4, Sect. 3]. The 
asymptotic result of Szeg6 cited in connection with the Euclidean condition 
number of the Hilbert matrix is taken from Szeg6 [1936]. For the explicit inverse 
of the Hilbert matrix, referred to in footnote 3, see Todd [1954]. The condition 
of Vandermonde and Vandermonde-like matrices has been studied in a series of 
papers by the author; for a summary, see Gautschi [1990], and Gautschi [201 1b] 
for optimally scaled and optimally conditioned Vandermonde and Vandermonde- 
like matrices. 


Sections 1.4 and 1.5. The treatment of the condition of algorithms and of the overall 
error in computer solutions of problems, as given in these sections, seems to be more 
or less original. Similar ideas, however, can be found in the book by Dahlquist and 
Bjérck [2008, Sect. 2.4]. 
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Exercises and Machine Assignments to Chapter 1 


Exercises 


1. Represent all elements of R4 (3,2) = {x € R(3,2): x > 0, x normalized} as 
dots on the real axis. For clarity, draw two axes, one from 0 to 8, the other from 
0 to i. 
2. (a) What is the distance d(x) of a positive normalized floating-point number 
x € R(t, s) to its next larger floating-point number: 


d(x) = min (y— x)? 
sue 


(b) Determine the relative distance r(x) = d(x)/x, with x as in (a), and give 
upper and lower bounds for it. 
3. The identity fi(1+x) = 1, x > 0, is true for x = 0 and for x sufficiently small. 
What is the largest machine number x for which the identity still holds? 
4. Consider a miniature binary computer whose floating-point words consist of 
four binary digits for the mantissa and three binary digits for the exponent (plus 
sign bits). Let 


x = (0.1011). x 2°, y = (0.1100), x 2°. 


Mark in the following table whether the machine operation indicated (with 
the result z assumed normalized) is exact, rounded (i.e., subject to a nonzero 
rounding error), overflows, or underflows. 


Operation Exact Rounded Overflow Underflow 
z= fi(x-y) 

z= fi((y — x)") 

z= fi(x+ y) 


z= fi(y + (x/4)) 
z= fix + (y/4)) 


5. The Matlab “machine precision” eps is twice the unit roundoff (2 x 2, 
t = 53; cf. Sect. 1.1.3). It can be computed by the following Matlab program 
(attributed to CLEVE MOLER): 
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SEI_5 Matlab machine precision 


c=b+b+b; 
eps0=abs (c-1) 


Run the program and prove its validity. 


. Prove (1.12). 
. A set S of real numbers is said to possess a metric if there is defined a 


distance function d(x, y) for any two elements x, y € S that has the following 
properties: 


(i) d(x, y) > O and d(x, y) = Oif and only if x = y (positive definiteness); 
(ii) d(x, y) = d(y, x) (symmetry), 
(iii) d(x, y) < d(x,z) + d(z, y) (triangle inequality). 


Discuss which of the following error measures is, or is not, a distance function 


on what set S of real numbers: 


(a) absolute error: ae(x, y) = |x — y|; 
(b) relative error: re(x, y) = ||; 
(c) relative precision (F.W.J. Olver, 1978): rp(x, y) = | In|x|—In]y| |. 


If y = x(1 + £), show that rp(x, y) = O(e) ase > 0. 


. Assume that xf, xj are approximations to x), x2 with relative errors EZ; and 


E4, respectively, and that |E;| < E,i = 1, 2. Assume further that x; # xp. 


(a) How small ; E (in dependence of x; and x2) be to ensure 


that xf A ne 
(b) Taking ——, to approximate ere obtain a bound on the relative error 
committed, assuming (1) exact arithmetic; (2) machine arithmetic with 
machine precision eps. (In both cases, neglect higher-order terms in EF}, 


Ep, eps.) 


XP x5 


. Consider the quadratic equation x? + px + q = 0 with roots x, x2. As seen in 


the second Example of Sect. 1.2.2, the absolutely larger root must be computed 
first, whereupon the other can be accurately obtained from x; x2 = q. Suppose 
one incorporates this idea in a program such as 


xl=abs (p/2)+sqrt (p*p/4-q) ; 
if p>0, xl=-x1; end 
x2=q/x1; 


Find two serious flaws with this program as a “general-purpose quadratic 
equation solver.” Take into consideration that the program will be executed in 
floating-point machine arithmetic. Be specific and support your arguments by 
examples, if necessary. 
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10. 


Suppose, for |x| small, one has an accurate value of y = e* — 1 (obtained, e.g., 
by Taylor expansion). Use this value to compute accurately sinhx = $(e* - 
e~*) for small |x]. 


. Let f(x) = V1 4+ x?-1. 


(a) Explain the difficulty of computing f(x) for a small value of |x| and show 
how it can be circumvented. 

(b) Compute (cond f)(x) and discuss the conditioning of f(x) for small |x|. 

(c) How can the answers to (a) and (b) be reconciled? 


. The nth power of some positive (machine) number x can be computed 


(i) either by repeated multiplication by x, or 
(ii) as x” =e!” Inx 


In each case, derive bounds for the relative error due to machine arithmetic, 
neglecting higher powers of the machine precision against the first power. 
(Assume that exponentiation and taking logarithms both involve a relative error 
€ with |e| <eps.) Based on these bounds, state a criterion (involving x and 7) 
for (1) to be more accurate than (ii). 


. Let f(x) = (1 —cos x)/x, x #0. 


(a) Show that direct evaluation of f is inaccurate if |x| is small; assume 
fi(f(x)) = fl((1 — fl(cos x))/x), where fl(cosx) = (1 + €,.) cos x, and 
estimate the relative error ¢ of fl( f(x)) as x > 0. 

(b) A mathematically equivalent form of f is f(x) = sin? x/(x(1 + cos x)). 
Carry out a similar analysis as in (a), based on fl( f(x)) = fi([fi(sin x)]?/ 
fl(x(1 + fi(cosx)))), assuming fl(cosx) = (1 + &)cosx, fi(sinx) = 
(1 + &s)sinx and retaining only first-order terms in ¢, and e,. Discuss 
the result. 

(c) Determine the condition of f(x). Indicate for what values of x (if any) 
Ff (x) is ill-conditioned. (|x| is no longer small, necessarily.) 


_ If z=x + iy, then /z7= coe le ea) ae where r = (x2 4 y?)!/2, 


Alternatively, /z = u+iv,u = eo ue v = y/2u. Discuss the 
computational merits of these two (mathematically equivalent) expressions. 
Illustrate with z = 4.5 + 0.0251, using eight significant decimal places. {Hint: 


you may assume x > O without restriction of generality. Why?} 


. Consider the numerical evaluation of 


ia 1 


fo 2, Lat=2rG=0=1)" 


n=0 


say, for t = 20, and 7-digit accuracy. Discuss the danger involved. 


. Let X be the largest positive machine representable number, and X_ the 


absolute value of the smallest negative one (so that -X¥_ < x < X¥4 for any 
machine number x). Determine, approximately, all intervals on R on which the 
tangent function overflows. 
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17. 


18. 


19. 


20. 


21. 
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(a) Use Matlab to determine the first value of the integer n for which n! 
overflows. {Hint: use Stirling’s formula for n!.} 

(b) Do the same as (a), but for x”, x = 10, 20,..., 100. 

(c) Discuss how x"e~*/n! can be computed for large x and n without 
unnecessarily incurring overflow. {Hint: use logarithms and an asymptotic 
formula for Inn!.} 


Consider a decimal computer with three (decimal) digits in the floating-point 
mantissa. 


(a) Estimate the relative error committed in symmetric rounding. 

(b) Let x; = 0.982, x2 = 0.984 be two machine numbers. Calculate in 
machine arithmetic the mean m = $(X1 + x2). Is the computed number 
between x, and x2? 

(c) Derive sufficient conditions for x; < fl(m) < x2 to hold, where x,, x2 are 
two machine numbers with 0 < x) < X». 


For this problem, assume a binary computer with 12 bits in the floating-point 
mantissa. 


(a) What is the machine precision eps? 

(b) Let x = 6/7 and x* be the correctly rounded machine approximation to x 
(symmetric rounding). Exhibit x and x* as binary numbers. 

(c) Determine (exactly) the relative error € of x* as an approximation to x, and 
calculate the ratio |e|/eps. 


The distributive law of algebra states that 
(a+ b)c =ac+be. 


Discuss to what extent this is violated in machine arithmetic. Assume a 
computer with machine precision eps and assume that a, b, c are machine- 
representable numbers. 


(a) Let y; be the floating-point number obtained by evaluating (a + b)c (as 
written) in floating-point arithmetic, and let y; = (a+b)c(1+e;). Estimate 
|e;| in terms of eps (neglecting second-order terms in eps). 

(b) Let yz be the floating-point number obtained by evaluating ac + bc (as 
written) in floating-point arithmetic, and let yp = (a+b)c(1+e2). Estimate 
|e2| (neglecting second-order terms in eps) in terms of eps (and a, b, and 
Cc). 

(c) Identify conditions (if any) under which one of the two y; is significantly 
less accurate than the other. 


Let x1, X2,...,X,,” > 1, be machine numbers. Their product can be computed 
by the algorithm 


Pi=M1, 


Pe = fl(xepe-i), =k = 2,3,...,n. 


Exercises 35 


22. 


23. 


24. 


25. 


26. 


27. 


(a) Find an upper bound for the relative error (py — %1X2-°+-Xn)/ 
(X1X2-++X,) in terms of the machine precision eps and n. 


. . 1 
(b) For any integer r > | not too large so as to satisfy r - eps < 79, show that 


(1 + eps)’ —1 < 1.06-r- eps. 


Hence, for 7 not too large, simplify the answer given in (a). {Hint: use the 
binomial theorem.} 


Analyze the error propagation in exponentiation, x* (x > 0): 


(a) assuming x exact and @ subject to a small relative error €,; 
(b) assuming @ exact and x subject to a small relative error e,. 


Discuss the possibility of any serious loss of accuracy. 
Indicate how you would accurately compute 


Gp yaa, x>0, y>QO. 


(a) Let a = 0.23371258 x 10~+, b = 0.33678429 x 107, c = —0.33677811 x 
10°. Assuming an 8-decimal-digit computer, determine the sum 5 = a + 
b+c either as (1) fl(s) = fil(fl(a+b)+c) oras (2) fil(s) = fl(a+fl(b+c)). 
Explain the discrepancy between the two answers. 

(b) For arbitrary machine numbers a, b, c on a computer with machine 
precision eps, find a criterion on a, b, c for the result of (2) in (a) to be 
more accurate than the result of (1). {Hint: compare bounds on the relative 
errors, neglecting higher-order terms in eps and assuminga+b+c # 0; 
see also MA 7.} 

Write the expression a” — 2ab cosy + b? (a > 0,b > 0) as the sum of two 

positive terms to avoid cancellation errors. Illustrate the advantage gained in 

the casea = 16.5, b = 15.7, y = 5°, using 3-decimal-digit arithmetic. Is the 
method foolproof? 

Determine the condition number for the following functions: 


(a) f(x) =Inx, x>0; 
(b) f(x) =cos x, |x| < 50; 
(c) f(x) =sin!x, |x| <1; 


(d) f(x) = sin 


T+ x2 
Indicate the possibility of ill-conditioning. 
Compute the condition number of the following functions, and discuss any 
possible ill-conditioning: 


(a) f(x) =x!/" (x > 0, n > Oan integer); 
(b) f(x) =x-—vVx?2-1 (x>1); 

(c) f (x1, %2) = xt +5; 

(d) f (x1, x2) = x1 + X2. 
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28. 


29. 


30. 


31. 


32. 


33. 


34. 
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(a) Consider the composite function h(t) = g(f(t)). Express the condition of 
h in terms of the condition of g and f. Be careful to state at which points 
the various condition numbers are to be evaluated. 

(b) Illustrate (a) with h(t) = nt t= in. 

Show that (cond f - g)(x) < (cond iG) + (cond g)(x). What can be said 

about (cond f/g)(x)? 

Let f : R? — R be given by y = x; + x2. Define (cond f)(x) = 

(cond; f )(x)+ (cond) f)(x), where x = [x1, x2] (cf. (1.27)). 


(a) Derive a formula for «(x1, x2) = (cond f)(x). 
(b) Show that «(x;, x2) as a function of x,,x2 is symmetric with respect to 
both bisectors b; and bp (see figure). 


A? 
b 
> x, 
b, 
(c) Determine the lines in R? on which «(x,,x2) = c,c > 1a constant. 


(Simplify the analysis by using symmetry; cf. part (b).) 


Let || - || be a vector norm in R” and denote by the same symbol the associated 
matrix norm. Show for arbitrary matrices A, B € R"”*” that 


(a) ||AB|| < [Al |B); 
(b) cond(A B) < cond A - cond B. 


Prove (1.32). {Hint: let moo = max, )?,, |dvy|- Show that ||A |loo < Moo as 
well as ||A ||oo = Moo, the latter by taking : a special vector x in (1.30).} 


Let the L; norm of a vector y = [y,] be defined by ||y ||; = >0, |ya|. Fora 
matrix A € R”*’", show that 


— max HA*lh _ 
|| A Ili: = = max )> lavyul, 
Re v 


ven Ile lh 


that is, || A ||; is the “maximum column sum.” {Hint: let m; = max, ~, |dvy|- 
Show that ||A ||; < 7 as well as ||Al|1 > m1, the latter by taking for x in 
(1.30) an appropriate coordinate vector. } 

Let a, qg be linearly independent vectors in R” of (Euclidean) length 1. Define 


b(p) € R" as follows: 
b(p) =a—pq, pER. 
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35. 


36. 


37. 


Compute the condition number of the angle a(p) between b(p) and g at the 
value p = po = q'a. (Then b(po) 1 q; see figure.) Discuss the answer. 


b(Py) 


a 


q 


The area A of a triangle ABC is given by A = Sab sin y (see figure). Discuss 
the numerical condition of A. 


> 


Define, for x 4 0, 
d” —x 
f= foes) = Cy" (FS). n= 01.2. 
dx” 


(a) Show that { f,} satisfies the recursion 


—x 


k e 
Vek = —Ve-r + , k=1,2,3,...3 you 
x XxX 


{Hint: differentiate k times the identity e~* = x - (e~*/x).} 
(b) Why do you expect the recursion in (a), without doing any analysis, to be 
numerically stable if x > 0? How about x < 0? 
Support and discuss your answer to (b) by considering y, as a function of 
yo (which for yo = fo(x) yields f, = f,(x)) and by showing that the 
condition number of this function at fo is 


(c 


Nee 


1 


(cond yy )( fo) = Jen(x)| , 


where e, (x) = 1 +x +x7/2!+---+ x"/n! is the nth partial sum of the 
exponential series. {Hint: use Leibniz’s formula to evaluate f,, (x).} 


Consider the algebraic equation 


x" +ax-1=0, a>0, n>2. 
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38. 


39. 


40. 


41. 


42. 
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(a) Show that the equation has exactly one positive root &(a). 
(b) Obtain a formula for (cond &)(a). 
(c) Obtain (good) upper and lower bounds for (cond &)(a). 


Consider the algebraic equation 


x"+x""!_qa=0, a>0, n>2. 


(a) Show that there is exactly one positive root &(a). 
(b) Show that &(a) is well conditioned as a function of a. Indeed, prove 


(cond €)(a) < a : 
n—1 


Consider Lambert’s equation 


for real values of x and a. 


(a) Show graphically that the equation has exactly one root &(a) > O if a > 0, 
exactly two roots &(a) < &(a) < 0if —1/e < a < 0,a double root —1 if 
a = —1/e, and no root if a < —1/e. 

(b) Discuss the condition of &(a), &(a), &(a) as a varies in the respective 
intervals. 


Given the natural number 1, let § = &(a) be the unique positive root of 
the equation x” = ae™~* (a > QO). Determine the condition of & as a 
function of a; simplify the answer as much as possible. In particular, show that 
(cond €)(a) < 1/n. 

Let f (x1, x2) = x; + x2 and consider the algorithm A given as follows: 


fa: R'G,8) > RG 8) ya = f(x + x2). 


Estimate y(x1, x2) = (cond A)(x), using any of the norms 


lll = Leal + bel, Welle = x7 +23, [x lloo = max (|x1], |x21). 


Discuss the answer in the light of the conditioning of f. 
This problem deals with the function f(x) = V1—x—1,-0oo <x <1. 


(a) Compute the condition number (cond /)(x). 

(b) Let A be the algorithm that evaluates f(x) in floating-point arithmetic on 
a computer with machine precision eps, given an (error-free) floating-point 
number x. Let €1, €2, €3 be the relative errors due, respectively, to the 
subtraction in 1 — x, to taking the square root, and to the final subtraction of 
1. Assume |e;| < eps @ = 1,2, 3). Letting f4(x) be the value of f(x) so 
computed, write f4(x) = f(x,4) and x4 = x(1 +e,). Express ¢4 in terms 
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43. 


44, 


of x, €1, €2, €3 (neglecting terms of higher order in the ¢;). Then determine 
an upper bound for |¢4| in terms of x and eps and finally an estimate of 
(cond A)(x). 

(c) Sketch a graph of (cond f)(x) (found in (a)) and a graph of the estimate 
of (cond A)(x) (found in (b)) as functions of x on (—oo, 1). Discuss your 
results. 


Consider the function f(x) = 1 —e~* on the interval 0 < x < 1. 


(a) Show that (cond f)(x) < 1 on [0,1]. 

(b) Let A be the algorithm that evaluates f(x) for the machine number x in 
floating-point arithmetic (with machine precision eps). Assume that the ex- 
ponential routine returns a correctly rounded answer. Estimate (cond A)(x) 
for 0 < x < 1, neglecting terms of O(eps*). {Point of information: 
In(l +e) =e+ O(e’),e > 0.3 

Plot (cond f)(x) and your estimate of (cond A)(x) as functions of x on 
[0,1]. Comment on the results. 


= 
Q 
nar 


(a 


wm 


Suppose A is an algorithm that computes the (smooth) function f(x) fora 
given machine number x, producing f4(x) = f(x)(1+e,), where |e¢| < 
y(x)eps (eps = machine precision). If 0 < (cond f)(x) < oo, show that 


g(x) 
(cond A)(x) < (cond f)(x) 


if second-order terms in eps are neglected. {Hint: set f4(x) = f(xa), 
x4 = x(1 + €,4), and expand in powers of ¢4, keeping only the first.} 

(b) Apply the result of (a) to f(x) = 1cosx O0<x< 50, when evaluated 
as shown. (You may assume that cos x and sinx are computed within a 
relative error of eps.) Discuss the answer. 

(c) Do the same as (b), but for the (mathematically equivalent) function 


sin 3 . 1 
FIX) = Geox 0 SX A ge. 
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Let x = 1+2:/10°. Compute the nth power of x for n = 100,000,200,000, ..., 
1,000,000 once in single and once in double Matlab precision. Let the two 
results be p, and dpy. Use the latter to determine the relative errors 7, of the 
former. Print 1, Pyn,dpn,1n,tn/(n - epsO), where epsO is the single-precision 
eps. What should x” be, approximately, when n = 1,000,000? Comment on 
the results. 


. Compute the derivative dy/dx of the exponential function y = e* atx = 0 


from the difference quotients d(h) = (e” — 1)/h with decreasing h. Use 


(a) h=hl:=27,i =5:5:50; 
(b) h = h2:= (2.2)"',i =5:5:50. 
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Print the quantities i,1,2,d1:= d(h1),d2 := d(h2), the first and two last 
ones in £-format, the others in e-format. Explain what you observe. 


. Consider the following procedure for determining the limit lim (e’—1)/hona 
1—>0 


computer. Let 


e "—] 
d, = fl p= forn = 0,1,2,... 


and accept as the machine limit the first value satisfying d, = d,-1 
(n > 1). 


(a) Write and run a Matlab routine implementing the procedure. 

(b) In R(¢, s)-floating-point arithmetic, with rounding by chopping, for what 
value of n will the correct limit be reached, assuming no underflow (of 
2-") occurs? {Hint: use ee” = 1 +h + sh? +--+ .} Compare with the 
experiment made in (a). 

(c) On what kind of computer (i.e., under what conditions on s and f) will 
underflow occur before the limit is reached? 


. Euler’s constant y = 0.57721566490153286... is defined as the limit 


1 1 1 
y= lim y,, where y, =1+—=+=+4+-::+——Inn. 
noo 2 3 n 


Assuming that y — y, ~ cn~“, n — 00, for some constants c and d > 0, try to 


determine c and d experimentally on the computer. 


. Letting Au, = uy+1 — up, one has the easy formula 


N 
) Aun =UN+1 — Uy}. 


n=1 


With u, = In(1 +7), compute each side (as it stands) for VN = 1,000: 1,000: 
10,000, the left-hand side in Matlab single precision and the right-hand side in 
double precision. Print the relative discrepancy of the two results. Repeat with 
x uy: Compute the sum in single and double precision and compare the 
results. Try to explain what you observe. 


. (a) Write a program to compute 


re. 
oe n(n +1)’ 


n=1 


efi. 4 
Sy = ag eer 
os d E n+ | 
once using the first summation and once using the (mathematically equiv- 


alent) second summation. For N = 10°,k = 1:7, print the respective 
absolute errors. Comment on the results. 
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(b) Write a program to compute 


7. (a) 


(b 


wm 


n 
b= || oa 


n=1 


For the same values of N as in part (a), print the relative errors. Comment 
on the results. 

Prove: based on best possible relative error bounds, the floating-point 
addition fl(fl(x + y) + z) is more accurate than fl(x + fl(y + z)) if and 
only if |x + y| < |y + z|. As applications, formulate addition rules in the 
cases 


(al) O<x<y<z 
(a2) x > 0, y <0,z>0; 
(a3) x <0, y >0,z <0. 


Consider the mth partial sums of the series defining the zeta function ¢(s), 
resp., eta function n(s), 


n 


in = 2 - en = eps 
k=1 


Fors = 2,11/3,5,7.2, 10 andn = 50, 100, 200, 500, 1000, compute these 
sums in Matlab single precision, once in forward direction and once in 
backward direction, and compare the results with Matlab double-precision 
evaluations. Interpret the results in the light of your answers to part (a), 
especially (a2) and (a3). 


8. Letn = 10° and 


s=10 n+ Yon k. 
k=1 


(a) Determine s analytically and evaluate to 16 decimal digits. 
(b) The following Matlab program computes s in three different (but mathe- 


matically equivalent) ways: 


SMAI 8B 


2 
6 


n=1076; s0=10*11l«*n; 

sl=s0; 

for k=1:n 
sl=sl+log(k) ; 

end 

s2=0; 

for k=1:n 
s2=s2+1log(k) ; 
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end 
S2=S2+s0; 
i=1:n; 
s3=s0+sum(log(i)); 
[sl s2 s3]’ 


Run the program and discuss the results. 


9. Write a Matlab program that computes the Euclidean condition number of the 
Hilbert matrix H,, following the prescription given in footnote 3 of the text. 


(a) The inverse of the Hilbert matrix H,, has elements 


. F : : 2 
(ast, comers oP Nts?) 
eee | n-l b= 


(cf. Note 3 to Sect. 1.3.2). Simplify the expression to avoid factorials of 
large numbers. {Hint: express all binomial coefficients in terms of factorials 
and simplify.} 

(b) Implement in Matlab the formula obtained in (a) and reproduce Table 1.1 
of the text. 


10. The (symmetrically truncated) cardinal series of a function f is defined by 


Z x—kh 
Cn(f,h)\(x) = = ff (kh) sinc (=). 


k=—N 


where h > 0 is the spacing of the data and the sinc function is defined by 


an) ifu 4 0, 
sinc(u) = wu 
1 ifu = 0. 


Under appropriate conditions, Cy (f, 2)(x) approximates f(x) on [—Nh, Nh]. 


(a) Show that 


N 


h | Wx —1)* 
CHUNG) = 4 a h a 
k=—N 


f(kh). 


Since this requires the evaluation of only one value of the sine function, 
it provides a more efficient way to evaluate the cardinal series than the 
original definition. 
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(b) While the form of Cy given in (a) may be more efficient, it is numerically 
unstable when x is near one of the abscissae kh. Why? 

(c) Find a way to stabilize the formula in (a). {Hint: introduce the integer ko 
and the real number f such that x = (ko + t)A, |t| < 5. 

(d) Write a program to compute Cy(f,h)(x) according to the formula in (a) 
and the one developed in (c) for N = 100, h = 0.1, f(x) = x exp(—x?), 
and x = 0.55, x = 0.5+ 10-8, x = 0.54 10}. Print Cy (f, h)(x), f(x), 
and the error |Cy (f, 2)(x) — f(x)| in either case. Compare the results. 


. In the theory of Fourier series, the numbers 


known as Lebesgue constants, are of some importance. 


(a) Show that the terms in the sum increase monotonically with k. How do the 
terms for k near n behave when n is large? 

(b) Compute A, for n = 1,10,10*,...,10° in Matlab single and double 
precision and compare the results. Do the same with 1 replaced by [7/2]. 
Explain what you observe. 


. Sum the series 


(—) DECD"/nP, (b) DU1/aP 


n=0 n=0 


until there is no more change in the partial sums to within the machine 
precision. Generate the terms recursively. Print the number of terms required 
and the value of the sum. (Answers in terms of Bessel functions: (a) Jo(2); 
cf. Abramowitz and Stegun [1964, (9.1.18)] or Olver et al. [2010, (10.9.1)] and 
(b) [o(2); cf. Abramowitz and Stegun [1964, (9.6.16)] or Olver et al. [2010, 
(10.32.1)].) 


. (PJ. Davis, 1993) Consider the series )77~, EEN PE Try to compute the sum 


to three correct decimal digits. 


1 n 
lim (1 + -) =e. 
n—>oo n 


What is the “machine limit’? Explain. 


. We know from calculus that 


. Let f(x) = (n + 1)x — 1. The iteration 


Xk = f(xx-1), k= 1,2,...,K; xo = 1/n, 


in exact arithmetic converges to the fixed point 1/n in one step (Why?). What 
happens in machine arithmetic? Run a program with n = 1:5 and K = 10: 
10 : 50 and explain quantitatively what you observe. 
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16. Compute the integral i, e*dx from Riemann sums with n equal subintervals, 
evaluating the integrand at the midpoint of each. Print the Riemann sums for 
n = 5,000 : 5,000 : 100,000 (to 15 decimal digits after the decimal point), 
together with absolute errors. Comment on the results. 

17. Let yn = fo t”e7tdt,n =0,1,2,.... 


(a) Use integration by parts to obtain a recurrence formula relating yz to yx—1 
fork = 1,2,3,..., and determine the starting value yo. 

(b) Write and run a Matlab program that generates yo, y1,..., 20, using the 
recurrence of (a), and print the results to 15 decimal digits after the decimal 
point. Explain in detail (quantitatively, using mathematical analysis) what 
is happening. 

(c) Use the recursion of (a) in reverse order, starting (arbitrarily) with yy = 
0. Place into five consecutive columns of a (21 x 5) matrix Y the values 
yw) yl... y thus obtained for N = 22,24, 26, 28, 30. Determine 
how much consecutive columns of Y differ from one another by printing 


e; = max|(YC,i +1 —-YC,1))./YG,i+)|, i =1,2,3,4. 


Print the last column Y(:, 5) of Y and explain why this represents accurately 
the column vector of the desired quantities yo, y1,..., 20. 


Selected Solutions to Exercises 


14. We may assume x > 0, since otherwise we could multiply z by —1 and the 
result by —i. 

In the first expression for ,/z there will be a large cancellation error in 

the imaginary part when |y| is very small, whereas in the second expression 

all arithmetic operations are benign. Illustration: z = 4.5 + 0.0251 (in eight 


significant digits) 
r = 4,5000694, 
r—x\1/2 24 
—_ ( ) = 5.8906706 x 1073, 
2 
V = 475 = 5.8925338 x 107°. 
2S) 


The last five digits in the first evaluation of v are in error! 
21. (a) We have 


PHN, 


Pe = Xk Pr-11 + €x), |ex| < eps, k = 2,3,..., n. 


Selected Solutions to Exercises 45 


Therefore, 

P2 = X1X2(1 + €2), 

P3 = XX2X3(1 + €2)(1 + €3), 

Pr = X1xX2°++Xy,(C1 + é2)(1 + &3)0°° ad + En); 
so that 

n — X1X2°°*Xn 
Pow Mien = (1+ en)(1+e3)---(L ten) 1 E. 
X1X2Q °° Xy 


If E > 0, then |E| < (1 + eps)”"—! — 1; otherwise, |E| < 1—(1—eps)”""!. 
Since the first bound is larger than the second, we get 


|E| < (1 +eps)""! -1. 
(b) Using the binomial theorem, one has 
Lg _— — P i 2 eee r — 
(1 + eps) 1=(1+( ) eps + (5) eps? + + eps’) 1 


1 
(r — 1)(r — 2) 5 
—_—__——— ep 


§ r—1 
=e Le Ee 31 Soe: 
(r —1)(r —2)---1 rl 
+ = eps . 


1 
Since r+ eps < To’ one has also 


1 
—k <—, k =1,2,...,r—1, 
(r—Keps< k= 12.7 
and so 


r 1 -1 1 —2 1 —(r—1) 
(1+ eps)’ —1 <r-eps;1+ —10 + —10°+---+ —10 
2! 3! r! 


A 


r-e s-10s tio + 2g 4.2 
ee Ll 2! 


= r-eps- 10{e! | — 1} = 1.051709...r- eps 


< 1.06-r-eps. 
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Hence, if (n — 1)eps < 1/10, the result in (a) can be simplified to 


|E| < 1.06( — l)eps. 


34. We have 
ee eee a q' (a — pq) 
lgll- bol] = [(a — pg)" (a — pq)]'/? 
Po — P =: R(p), 


~ (= 2pop + p22 
a(p) = cos! R(p), R(po) = 0 if |po| < 1, 


a'(p) R'(p) 


II 


=f 

V1— R*(p) 
af 

JI— R(p) 


ye Ll = 2pop + 7"? = (po = p)IC = 2p0p + p°)"/71 
(1 — 2pop + p?) , 


For p = fo, therefore, assuming |p| < 1, we get 


a’ (po) => U — poy? = J 
I= pj (=p)? 
lpol a= _ 2 Aol 
(cond a)(p9) = —{—"— = 


aI I (1 — pi)?" 


If po = 0, i.e., a is already orthogonal to g, hence b = a, then (conda@)(po) = 
0, as expected. If |po| * 1, then (conda)(po) f oo, since in the limit, a is 
parallel to g and cannot be orthogonalized. In practice, if || is close to 1, the 
problem of ill-conditioning can be overcome by a single, or possibly repeated, 
reorthogonalization. 

44. (a) Following the Hint, we have 


Sax) = f(x) + €¢) = f(%a) 
= f(x(. + €4)) = f(x + x€4) 
= f(x) + xe4f/(x) + O(e%). 
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Neglecting the O(e%,) term, one gets 


xea f(x) = f(xer, 


hence 
= —| = ea ei ler| Ss a eps 
x a ep | ond Fae) 
which proves the assertion. 
(b) One easily computes 
(cond f)(x) = a , O<x< iat 
sin x 2 


Furthermore, 


_ (L=(cos.x)(1 + €1))(1 + £2) | 
Hes a (1 +4), lei] < eps, 


where €1, €2, €3, €4 are the relative errors committed, respectively, in evaluating 
the cosine function, the difference in the numerator, the sine function, and the 
quotient. Neglecting terms of O(é?), one obtains by a simple computation that 


— COS X 


fie 


cos x 
{ltete—es—e1 2 


sin x 1—cosx 


that is, 


cos x cos x 
le r| = Jez + €4 — €3 — €] ———_| < (3+ ———_ J eps. 
1—cosx 1—cosx 


Therefore, 
cos x 


g(x) =3+———, 
l—cosx 


and one gets 


COS X 


(cond A)(x) < = (3 + ds O0<x< ie 
x 2 


1—cosx 


Obviously, (cond A)(x) — oo as x — O, whereas (cond A)(x) — 6/7 as 
x — 1/2. The algorithm is ill-conditioned near x = 0 (cancellation error), but 
well conditioned near 2/2. The function itself is quite well conditioned, 


1 < (cond f)(x) < - 
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(c) In this case (the condition of f being of course the same), 


(sinx)(1 + €1) 


x)= 1+ <4), |e;| < eps, 
LAO) = Teor ted een oe lS 
giving 
sin x cos x 
=" = ey Speen 
fala) rast | acer ae ares 
that is, 
ler|= Sp eaeedabyagye = < oy eee eps, 
1+cosx 1+ cosx 
and 


sin x cos x 
(cond A)(x) < 3+ : 
x 1+cosx 


Now, A is entirely well conditioned, 


= aon) 


7 
< (cond A)(x) < 5 O0<x< > 
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7. (a) For arbitrary real x, y, z, the first addition can be written as 


fi(fl(x + y) +z) = (x+y) +61) +20 4+ €2) 


2 


x+y te+(xe+yertatyt er 


x+y 
x+y+z){ 1+ ——e,4+28@], 
( y i( oT ») 


where the ¢; are bounded in absolute value by eps. The best bound for the 


relative error is 


Ix + y| 


|rel.err.| < (=, 
[ea ae 


+ 1) eps. 


Selected Solutions to Machine Assignments 49 


Likewise, for the second addition, there holds (interchange x and z) 


|rel.err.| < lee) +1) eps. 
Ixty+2l 


Based on these two error bounds, the first addition is more accurate than 
the second if and only if |x + y| < |y + z|, as claimed. 


Examples 


(al) 0< x < y <z. Here, 
eryl=s+y<y+e=|y+4), 


Thus, addition in increasing order is more accurate. 
(a2) x > 0, y < 0,z > 0. Here, 


lx+y|=|x—lyll, ly +2] =lz-lyIl, 


and adding to the negative number y the positive number closer to |y| 
first is more accurate. 
(a3) x <0, y > 0,z < 0. Here, 


lx + yl = [-lal + yl] = Ilal— yl, 


ly +21 = ly —Iell = Ile —yl, 


and adding to the positive number y the negative number first whose 
modulus is closer to y is more accurate. 


(b) PROGRAM 


SMAI _7B 


x 
© 


£0='%6.4£ %8.1le %9.2e %$9.2e %9.2e %9.2e\n’; 
disp (’ zeta eta’) 
disp (’ s n forw backw forw backw’ ) 
for s=[2 11/3 5 7.2 10] 
for n=[50 100 200 500 1000] 

k=1:n; 

z=sum(1./k.%s); 

e=sum((-1) .*(k-1)./k.%s); 

zf=single(0); ef=single(0); 

for kieilsn 

zf=zf+single(1/kf"s) ; 
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ef=ef+single((-1)*(kf£-1)/kf£*s) ; 
end 
zfO=zf; efO=ef; 
zb=single(0); eb=single(0) ; 
for kb=n:<1:1 
zb=zb+single(1/kb‘*s) 
eb=eb+single((-1)*(kb-1)/kb’*s) ; 
end 
zb0=zb; eb0O=eb; 
errzf=abs ((z£0-z)/z); 
errzb=abs ((zb0-z)/z); 
erref=abs ((ef0-e) /e); 
erreb=abs ( (eb0-e) /e) ; 
fprintf(f0,s,n,errzf,errzb,erref,erreb) 
end 
fprintf(’\n’) 


end 

OUTPUT 
>> MAI_7B 

zeta eta 

s n forw backw forw backw 
2.0000 5.0e+01 1.14e-07 3.30e-08 5.16e-08 2.09e-08 
2.0000 1.0e+02 7.11e-08 1.82e-09 4.35e-08 4.35e-08 
2.0000 2.0e+02 9.34e-08 5.20e-08 1.16e€-07 2.94e-08 
2.0000 5.0e+02 4.52e-08 4.52e-08 3.92e-07 3.01e-08 
2.0000 1.0e+03 1.70e€-07 4.77e-08 3.85e-07 2.23e-08 
3.6667 5.0e+01 2.48e-07 3.30e€-08 2.84e-08 3.54e-08 
3.6667 1.0e+02 1.26e-07 1.82e-08 5.97e-08 4.07e-09 
3.6667 2.0e+02 1.43e-06 3.15e-08 8.21e-08 1.84e-08 
3.6667 5.0e+02 1.65e-06 4.06e-08 8.40e-08 2.02e-08 
3.6667 1.0e+03 1.67e-06 4.88e-08 8.41e-08 2.03e-08 
5.0000 5.0e+01 2.46e-07 1.61e-08 4.04e-08 2.09e-08 
5.0000 1.0e+02 2.81e-07 5.08e-08 3.89e-08 2.24e-08 
5.0000 2.0e+02 2.83e-07 5.30e€-08 3.88e-08 2.25e-08 
5.0000 5.0e+02 2.83e-07 5.31e-08 3.88e-08 2.25e-08 
5.0000 1.0e+03 2.83e-07 5.31e-08 3.88e-08 2.25e-08 
7.2000 5.0e+01 7.20e-09 7.20e-09 5.45e-08 5.52e-09 
7.2000 1.0e+02 7.20e-09 7.20e-09 5.45e-08 5.52e-09 
7.2000 2.0e+02 7.20e-09 7.20e-09 5.45e-08 5.52e-09 
7.2000 5.0e+02 7.20e€-09 7.20e-09 5.45e-08 5.52e-09 
7.2000 1.0e+03 7.20e-09 7.20e-09 5.45e-08 5.52e-09 


10.0000 5.0e+01 1.20e-08 1.20e-08 2.32e-08 2.32e-08 
10.0000 1.0e+02 1.20e-08 1.20e-08 2.32e-08 2.32e-08 
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10.0000 2.0e+02 1.20e-08 1.20e-08 2.32e-08 2.32e-08 
10.0000 5.0e+02 1.20e-08 1.20e-08 2.32e-08 2.32e-08 
10.0000 1.0e+03 1.20e-08 1.20e-08 2.32e-08 2.32e-08 
>> 


Interpretation 


Rather consistently, backward summation gives more accurate results than 
forward summation, by as much as two decimal orders (for s = 11/3 and 
n = 200 in the case of ¢(s)). For large s (for example s = 10) there is no 
noticeable difference in accuracy. All this (and more) is consistent with the 
answers given in part (a). Indeed, the series for the zeta function has terms that 
are all positive and strictly decreasing, so that by (al) summation in increasing 
order of the terms, i.e., backward summation, is more accurate. In the case of 
the eta series, consider 


1 1 1 


Oe ED 7 ED 


Then 


Basie i. we k \ 
XT is e+ ke k+1) )? 


ree bee ee 
t= eae ea ED ( See) 


Since the function x/(x + 1) for x > 0 increases monotonically, it follows that 
|x+ y| > |y+z| for all k > 0, and backward summation is more accurate than 
forward summation. Moreover, 


1 1\~* Ss 


te ee eee soy 
SS — _— _— _— _— ~ ——— => 
nS i k k k ket’ a 


so that the improvement u # 
asymptotically as k — oo. Also, for large s, the improvement is relatively 
small, even for only moderately large k. 


The same discussion holds for 


of backward over forward summation disappears 


1 1 1 


*= is Y= py Fea 
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11. (a) Let x = ~*~, so that0 < x < 5 if 1 < k <n. Then, up to a positive 


2n+1? 
constant factor, the general term of the sum is 


f(x) = - tan x. 


We show that f increases monotonically: we have 


1 


cos? x 


[xf@l = 


’ 


hence 


1 sin x 


xf'(x) = — f(x) = ——- 


cos? x cos?x xcosx 
1 i oe 
= - 1— —-sinx cosx 
cos? x x 
1 sin 2x 
= ; 1- > 0. 
cos’ x 2x 


Thus, the terms in the sum increase monotonically. For n very large, say 
n = 10°, most terms of the sum are negligibly small except for a few very 
near n, where they sharply increase to the maximum value ~ 2. This can 
be seen by letting k = n —r for some fixed (small) integer r and large n. 


Then 
n-r 1 2r+1 


el” 2 DOr ty 


and, asn > o, 


(n—r)x x « 2r+i cos($24) 4 on 
tan —————_. = tan _ =— : 
An+1 2 22n+1 sin (5 4+) mw 2r+1 
Therefore, 
1 (n—r)a 4 1 
tan ~ as n> ©. 
n—-r 2n+1 nz 2r+1 
(b) PROGRAM 
SMAI_11B 


2 
© 


£0='’%10.0f %12.8f %19.16f %12.4e\n’; 

disp (’ n Lebesgue Lebesgue double diff’) 

% disp (’ n truncated single and double Lebesgue’ ) 

for n=[1 10 100 1000 10000 100000] 
den0O=single(1/(2«*n+1)); 
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den=1/ (2*n+1) ; 

k=1:n; 

k=1:ceil (n/2); 
s0=sum(single(tan(k*pixden0) ./k)); 
s=sum(tan(k*xpixden) ./k) ; 
s0=den0+single(2«s0/pi) ; 
s=den+2«s/pi; 

diff=abs(s-s0) ; 

fprintf (f£0,n,subs(s0),s,diff) 

end 


ol 


OUTPUT 


>> MAI_11B 


n Lebesgue Lebesgue double diff 
a 1.43599117 1.4359911241769170 4.3845e-08 
10 2.22335672 2.2233569241536819 2.0037e-07 
100 3.13877344 3.1387800926548395 6.6513e-06 
1000 4.07023430 4.0701636043524356 7.0694e-05 
10000 5.00332785 5.0031838616314577 1.4398e-04 
100000 5.92677021 5.9363682125234796 9.5980e-03 


Comments 


Because of the behavior of the terms in the sum, when 7 is large, the 
accuracy of the sum is largely determined by the accuracy of the terms 
very near to n. But there, the argument of the tangent is very close to 2/2. 
Since 


1 + tan? 
(cond tan)(x) = ait ae a O0<x < 27/2, 
tan x 


the tangent is very ill-conditioned for x near 2/2. In fact, fore > 0 
very small, 


—-—€ ae 


2 


(cond tan) (5 - e) fue tan 


(= = Ww COSE 4 
5 = 


2 sine 2e° 


Since k = n corresponds to 5 — ¢ = i x, that is, e = XGnFly oo 
one has 
(cond t (5 ) - 2 
cond tan) (— — ¢) ~ ————_=2n, no. 
2 27 /(4n) 


So, forn = 10°, for example, we must expect a loss of about five decimal 
digits. This is confirmed by the numerical results shown above. 

The inaccuracy observed cannot be ascribed merely to the large volume 
of computation. In fact, if we extended the sum only to, say, k = n/2, we 
would escape the ill-conditioning of the tangent and, even forn = 10°, 
would get more accurate answers. This is shown by the numerical results 
below. 
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10 

100 
1000 
10000 
100000 


COOOCOC OF 
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truncated single and double Lebesgue 


-43599117 
- 57060230 
- 54363489 
-54078776 
-54050153 
-54047358 


COOOCOC OF 


-4359911241769170 
-5706023117647347 
-5436349730685069 
-5407873971420362 
-5405010907553436 
-5404724445910691 


4. 
-3982e-08 
-1558e-08 
-5930e-07 
-4418e-07 
»1358e-06 


HPoB WoR 


3845e-08 


Chapter 2 
Approximation and Interpolation 


The present chapter is basically concerned with the approximation of functions. The 
functions in question may be functions defined on a continuum — typically a finite 
interval — or functions defined only on a finite set of points. The first instance arises, 
for example, in the context of special functions (elementary or transcendental) that 
one wishes to evaluate as a part of a subroutine. Since any such evaluation must be 
reduced to a finite number of arithmetic operations, we must ultimately approximate 
the function by means of a polynomial or a rational function. The second instance 
is frequently encountered in the physical sciences when measurements are taken 
of a certain physical quantity as a function of some other physical quantity (such 
as time). In either case one wants to approximate the given function “as well as 
possible” in terms of other simpler functions. 

The general scheme of approximation can be described as follows. We are 
given the function f to be approximated, along with a class ® of “approximating 


functions” g and a “norm” || - || measuring the overall magnitude of functions. We 
are looking for an approximation @ € ® of f such that 
If -@ll < lf —¢ll forall ge ®. (2.1) 


The function ¢ is called the best approximation to f from the class ®, relative to 
the norm || - ||. 

The class ® is called a (real) linear space if with any two functions @,, 
92 € ® it also contains gy; + g2 and cg; for any c € R, hence also any (finite) 
linear combination of functions g; € ®. Given n “basis functions” 7; € ©, 
jJ = 1,2,...,n, we can define a linear space of finite dimension n by 


b=0,=i9: gt) = do cjx; (0), cj ER}. (2.2) 


j=l 


Examples of linear spaces ®. 1. ® = P,,: polynomials of degree < m. A basis for 
P,, is, for example, 7;(t) = t/—', 7 = 1,2,...,m+1,sothatn =m +4 1. 
Polynomials are the most frequently used “general-purpose” approximants for 
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dealing with functions on bounded domains (finite intervals or finite sets of 
points). One reason is Weierstrass’s theorem, which states that any continuous 
function can be approximated on a finite interval as closely as one wishes by a 
polynomial of sufficiently high degree. 


2.0= Sa (A): (polynomial) spline functions of degree m and smoothness class k 
on the subdivision 


A: @=t, <b <h <:+++<ty-1 <ty =D 


of the interval [a,b]. These are piecewise polynomials of degree < m pieced 
together at the “joints” f,...,¢y—1 in such a way that all derivatives up to and 
including the kth are continuous on the whole interval [a, b], including the joints: 


Sk (A) = {s € C*[a,b] : Py, i=1,2,...,N—1. 


Nia ti : 
We assume here 0 < k < m; otherwise, we are back to polynomials P,, (see 
Ex. 68). We set k = —1 if we allow discontinuities at the joints. The dimension 
of Sk (A) isn = (m—k)-(N —2) +m + 1 (see Ex. 71), but to find a basis is a 
nontrivial task; form = 1, see Sect. 2.3.2. 

3. ® = T,, [0, 27]: trigonometric polynomials of degree < m on [0, 27]. These 
are linear combinations of the basic harmonics up to and including the mth one, 
that is, 


mx (t) = cos (k—1)t, k = 1,2,...,m+ 1; 
Tm +1+k(t) =sinkt, k =1,2,...,m, 


where now n = 2m + 1. Such approximants are a natural choice when the 
function f to be approximated is periodic with period 27. (If f has period p, 
one makes a preliminary change of variables t +> ¢ - p/2z7.) 

4. ® = E,,: exponential sums. For given distinct a; > 0, one takes 7; (t) = e7 
jJ = 1,2,...,n. Exponential sums are often employed on the half-infinite 
interval R+: 0 < t < oo, especially if one knows that f decays exponentially as 
t>o. 


jt 
> 


Note that the important class of rational functions, 


®=R,,;=(9: G=p/q, PEP, ge Ps}, 


is not a linear space. (Why not?) 

Possible choices of norm — both for continuous and discrete functions — 
and the type of approximation they generate are summarized in Table 2.1. The 
continuous case involves an interval [a,b] and a “weight function” w(t) (possibly 
w(t) = 1) defined on [a, b] and positive except for isolated zeros. The discrete case 
involves a set of N distinct points f), f2,...,t, along with positive weight factors 
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Table 2.1 Types of approximation and associated norms 


Continuous norm Approximation Discrete norm 
Ilulloo = max |u(t)| Loo Ilulloo = max, |u(;)| 
ax<t<b 1<i<N 
Uniform 
Chebyshev 
b N 
Jali =f beciar Ly leds = lace] 
a _ 
= 
llalliw = / |u()|w(t)dr Weighted L ——flulliw =) wilu(ti)| 
a 


i=1 
1 


b N 2 
llullon = (/ momo Weighted L» llullon = (>: mac?) 


i=1 


NIE 


Least squares 


W1,W2,-..,Ww (possibly all equal to 1). The interval [a,b] may be unbounded if 
the weight function w is such that the integral extended over [a, b], which defines 
the norm, makes sense. 

Hence, we may take any one of the norms in Table 2.1 and combine it with any of 
the preceding linear spaces ® to arrive at a meaningful best approximation problem 
(2.1). In the continuous case, the given function f, and the functions ¢ of the class 
®, of course, must be defined on [a, b] and such that the norm || f —¢|| makes sense. 
Likewise, f and y must be defined at the points ¢; in the discrete case. 

Note that if the best approximant ¢ in the discrete case is such that || f — || = 0, 
then @(t;) = f(t;) fori = 1,2,...,N. We then say that @ interpolates f at 
the points ¢; and we refer to this kind of approximation problem as an interpola- 
tion problem. 

The simplest approximation problems are the least squares problem and the in- 
terpolation problem, and the easiest space ® to work with the space of polynomials 
of given degree. These are indeed the problems we concentrate on in this chapter. 
In the case of the least squares problem, however, we admit general linear spaces 
® of approximants, and also in the case of the interpolation problem, we include 
polynomial splines in addition to straight polynomials. 

Before we start with the least squares problem, we introduce a notational device 
that allows us to treat the continuous and the discrete case simultaneously. We 
define, in the continuous case, 


0 if t <a (whenever — co <a), 


i= / w(t)dt if a<t <b, (2.3) 


b 
/ w(t)dt if t > b (whenever b < oo). 
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Then we can write, for any (say, continuous) function u, 


b 
[una = / u(t)w(t)dr, (2.4) 
R a 

since dA(t) = 0 “outside” [a,b], and dA(t) = w(t)dt inside. We call dA a 
continuous (positive) measure. The discrete measure (also called “Dirac measure’’) 
associated with the point set {t), ,...,ty} is a measure dA that is nonzero only at 
the points ¢; and has the value w; there. Thus, in this case, 


N 
/ u(t)da(t) = Y > w;u(t;). Cs 
R 


i=l 


(A more precise definition can be given in terms of Stieltjes integrals, if we define 
X(t) to be a step function having jump w; at ¢;.) In particular, we can define the L» 
norm as 


Nie 


(dears ( 7 j(e)Paa(e)) (2.6) 


and obtain the continuous or the discrete norm depending on whether A is taken to 
be as in (2.3), or a step function, as in (2.5). 

We call the support of dA — and denote it by supp dA — the interval [a, b] in the 
continuous case (assuming w positive on [a,b] except for isolated zeros), and the 
set {t),f2,...,¢y} in the discrete case. We say that the set of functions zr; (¢) in (2.2) 
is linearly independent on the support of dA if 


> 6730) = 0 forall t € suppdA implies cy) = cp =---=c, = 0. (2.7) 
j=l 


Example: the powers 7; (t) = t/7', 7 =1,2,...,n. 


Here os cj7j(t) = Pn—1(t) is a polynomial of degree < n — 1. Suppose, first, 


j=l 
that supp dA = [a,b]. Then the identity in (2.7) says that p,—-1(t) = 0 on [a,b]. 
Clearly, this implies c} = cz = --- = cy, = 0, so that the powers are linearly 


independent on supp dA = [a, b]. If, on the other hand, suppdA = {t1,t,...,tw}, 
then the premise in (2.7) says that p,_)(¢;) = 0,7 = 1,2,..., N; that is, p,—) has 
N distinct zeros t;. This implies p,—; = 0 only if N > n. Otherwise, p,—\(t) = 
Te —1;) € Py, would satisfy p,—\(t;) = 0,7 = 1,2,...,N, without being 
identically zero. Thus, we have linear independence on supp dA = {t), fo,..., ty} if 
and only if N > n. 
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We specialize the best approximation problem (2.1) by taking as norm the Lz norm 


Nidan ( i jan) ; (2.8) 


where dA is either a continuous measure (cf. (2.3)) or a discrete measure (cf. (2.5)), 
and by using approximants g from an n-dimensional linear space 


b=, = 49: ot)=)_ cjxj(t), cj ERP. (2.9) 


j=l 


Here the basis functions 7; are assumed linearly independent on supp dA (cf. (2.7)). 
We furthermore assume, of course, that the integral in (2.8) is meaningful whenever 
u =a; oru= f, the given function to be approximated. 

The solution of the least squares problem is most easily expressed in terms of 
orthogonal systems zr; relative to an appropriate inner product. We therefore begin 
with a discussion of inner products. 


2.1.1 Inner Products 


Given a continuous or discrete measure dA, as introduced earlier, and given any two 
functions u, v having a finite norm (2.8), we can define the inner product 


(u,v) = [uommarco. (2.10) 


(Schwarz’s inequality |(u, v)| < ||u||2,aa - ||vll2.aa, cf. Ex. 6, tells us that the integral 
in (2.10) is well defined.) The inner product (2.10) has the following obvious (but 
useful) properties: 


. symmetry: (u,v) = (v, u); 

. homogeneity: (au, v) = a(u,v),a € R; 

. additivity: (u+ v,w) = (u,w) + (v, w); and 

. positive definiteness: (u,u) > 0, with equality holding if and only if u = 0 on 
supp dA. 


BRWN Ke 


Homogeneity and additivity together give linearity, 


(Uy + Q2U2,V) = Oy (Uy, V) + 2 (U2, Vv) (2.11) 
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Fig. 2.1 Orthogonal vectors and their sum 


in the first variable and, by symmetry, also in the second. Moreover, (2.11) easily 
extends to linear combinations of arbitrary finite length. Note also that 


lulls a = Wu). 1) 
We say that uv and v are orthogonal if 
(u,v) = 0. (2.13) 


This is always trivially true if either u or v vanishes identically on supp dA. 
It is now a simple exercise, for example, to prove the Theorem of Pythagoras: 


if (u,v) = 0, then |ju+ v||? = |jul|? + |lv|l, (2.14) 


where || - || = || - |]2a,. (From now on we use this abbreviated notation for the 
norm.) Indeed, 
lle + vl? = @tv.u tv) = eu) + (uv) + 0) +) 
= |u|? + 2@, v) + lvl? = [lull? + IVP. 
where the first equality is a definition, the second follows from additivity, the 
third from symmetry, and the last from orthogonality. Interpreting functions u, v 
as “vectors,” we can picture the configuration of u, v (orthogonal) and u + v as in 
Fig. 2.1. 
More generally, we may consider an orthogonal systems {ug };—: 
(uj,uj) =0 if i Aj, ux ZO onsupp da; 
On ee rs Ue ce ee (2.15) 
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For such a system we have the Generalized Theorem of Pythagoras, 


n 
DM 
k=1 


The proof is essentially the same as before. An important consequence of (2.16) is 
that every orthogonal system is linearly independent on the support of dA. Indeed, 
if the left-hand side of (2.16) vanishes, then so does the right-hand side, and this, 
since ||u; ||? > 0 by assumption, implies a) = a2 = --- = a, = 0. 


2 n 
= \- ag? lux |. (2.16) 
k=1 


2.1.2 The Normal Equations 


We are now in a position to solve the least squares approximation problem. By 
(2.12), we can write the L, error, or rather its square, in the form: 


E*[g):= |le- fl? =@-fe-N=@.9)-20. +h). 
Inserting g here from (2.9) gives 


2 


E7[g|= i > cjmj(t) | da@)-2 i Demi f(t)dA(t)+ i. PH AX). 


j=l 

(2.17) 
The squared L» error, therefore, is a quadratic function of the coefficients c,, 
C2,...,Cn Of g. The problem of best Ly approximation thus amounts to minimizing 


a quadratic function of n variables. This is a standard problem of calculus and is 
solved by setting all partial derivatives equal to zero. This yields a system of linear 
algebraic equations. Indeed, differentiating partially with respect to c; under the 
integral sign in (2.17) gives 


) 


Ele] =2 i: So em (t) | mi(tyaae) —2 i. mi(t) f(t)da(t), 


j=l 


and setting this equal to zero, interchanging integration and summation in the 
process, we get 


» Gas =(m;,f), i=1,2,..., n. (2.18) 


j=l 


These are called the normal equations for the least squares problem. They form a 
linear system of the form 


Ac =b, (2.19) 


where the matrix A and the vector b have elements 
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A => [aij]. aij = (1; , 70; ); b => [D;], b; => (;, f). (2.20) 


By symmetry of the inner product, A is a symmetric matrix. Moreover, A is positive 
definite; that is, 
xTAx =>) > aiyxjxj > 0 if x # [0,0,...,0]". (2.21) 


i=1 j=1 


The quadratic function in (2.21) is called a quadratic form (since it is homogeneous 
of degree 2). Positive definiteness of A thus says that the quadratic form whose 
coefficients are the elements of A is always nonnegative, and zero only if all 
variables x; vanish. 

To prove (2.21), all we have to do is insert the definition of the a;; and use the 
elementary properties 1-4 of the inner product: 


n n n n 2 


x'Ax = POD XjXj (1,0 j) = >> >. Gum. xj7;) = 


i=1 j=l i=1 j=l 


n 


) Xj Tj 


i=1 


This clearly is nonnegative. It is zero only if )~’_, x; 7; = 0 on supp dA, which, by 
the assumption of linear independence of the z;, implies xj = x2 =-:- =x, = 0. 

Now it is a well-known fact of linear algebra that a symmetric positive definite 
matrix A is nonsingular. Indeed, its determinant, as well as all its leading principal 
minor determinants, are strictly positive. It follows that the system (2.18) of normal 
equations has a unique solution. Does this solution correspond to a minimum of 
E[g] in (2.17)? Calculus tells us that for this to be the case, the Hessian matrix H = 
[0? E*/dc;dc;] has to be positive definite. But H = 2A, since E? is a quadratic 
function. Therefore, H, with A, is indeed positive definite, and the solution of the 
normal equations gives us the desired minimum. The least squares approximation 
problem thus has a unique solution, given by 


Ot) = ej; (t), (2.22) 
j=l 
where é = [@;,é2,...,€,]' is the solution vector of the normal equation (2.18). 


This completely settles the least squares approximation problem in theory. How 
about in practice? 

Assuming a general set of (linearly independent) basis functions, we can see the 
following possible difficulties. 


1. The system (2.18) may be ill-conditioned. A simple example is provided by 
suppdA = [0,1], dA(t) = dt on [0,1], and ;(t) = !, 7 = 1,2,..., n. 
Then 


1 


—— > 1. J = 1,2,05.505 
es ea 


1 
@i.x))= f pr de = 
0 


2.1 Least Squares Approximation 63 


that is, the matrix A in (2.18) is precisely the Hilbert matrix (cf. Chap. 1, (1.60)). 
The resulting severe ill-conditioning of the normal equations in this example 
is entirely due to an unfortunate choice of basis functions — the powers. These 
become almost linearly dependent, more so the larger the e venent (cf. Ex. 38). 
Another source of degradation lies in the element b; =f mj (t) f(t)dt of the 
right-hand vector b in (2.18). When j is large, the power 2; = t/—' behaves 
very much like a discontinuous function on [0,1]: it is practically zero for much 
of the interval until it shoots up to the value 1 at the right endpoint. This has 
the unfortunate consequence that a good deal of information about f is lost 
when one forms the integral defining b;. A polynomial zr; that oscillates rapidly 
on [0,1] would seem to be preferable from this point of view, since it would 
“engage” the function f more vigorously over all of the interval [0,1]. 

2. The second disadvantage is the fact that all coefficients ¢; in (2.22) depend on 
n; that is, ej = aa jJ = 1,2,...,n. Increasing n, for example, will give an 
enlarged system of normal equations with a completely new solution vector. We 
refer to this as the nonpermanence of the coefficients ¢;. 

Both defects 1 and 2 can be eliminated (or at least attenuated in the case of 1) 
in one stroke: select for the basis functions z; an orthogonal system, 


(m;,0;)=0 if i Aj; (aj, 2;) =||z; |? > 0. (2:23) 
Then the system of normal equations becomes diagonal and is solved immedi- 
ately by 
Tj. 
¢, = ED i eee (2.24) 
(1; ; Ij) 


Clearly, each of these coefficients ¢ j is independent of n, and once com- 
puted, remains the same for any larger n. We now have permanence of 
the coefficients. Also, we do not have to go through the trouble of solv- 
ing a linear system of equations, but instead can use the formula (2.24) 
directly. This does not mean that there are no numerical problems associ- 
ated with (2.24). Indeed, it is typical that the denominators ||7; ||? in (2.24) 
decrease rapidly with increasing j, whereas the integrand in the numera- 
tor (or the individual terms in the case of a discrete inner product) are of 
the same magnitude as f. Yet the coefficients ¢; also are expected to de- 
crease rapidly. Therefore, cancellation errors must occur when one computes 
the inner product in the numerator. The cancellation problem can be alleviated 
somewhat by computing ¢; in the alternative form 


j= 
é = 1-Ye eme.t; |), f =1,2,.....0, (2.25) 
Ba fel 


where the empty sum (when j = 1) is taken to be zero, as usual. Clearly, by 
orthogonality of the z;, (2.25) is equivalent to (2.24) mathematically, but not 
necessarily numerically. 


64 2 Approximation and Interpolation 


An algorithm for computing ¢; from (2.25), and at the same time @(t), is as 
follows: 


lly lP 
8; =Sj-1+ ¢;7;(t). 


This produces the coefficients ¢,, C2,...,€, aS well as @(t) = Sp. 

Any system {7;} that is linearly independent on the support of dA can be 
orthogonalized (with respect to the measure dA) by a device known as the 
Gram'—Schmidt *procedure. One takes 


Ty, = Ty 
and, for 7 = 2,3,... , recursively forms 
j—1 x 
a af (1;. 1k) 
mj =; — >> cere, C= = 7 
= (1K, x) 


Then each zr; so determined is orthogonal to all preceding ones. 


2.1.3 Least Squares Error; Convergence 


We have seen in Sect. 2.1.2 that if the class ® = ®, consists of n functions z7;, 
jJ = 1,2,...,n, that are linearly independent on the support of some measure dA, 
then the least squares problem for this measure, 


min || f — ¢ll2a. = || f — @lloa. (2.26) 
ged, 


'Jérgen Pedersen Gram (1850-1916) was a farmer’s son who studied at the University of 
Copenhagen. After graduation, he entered an insurance company as computer assistant and, 
moving up the ranks, eventually became its director. He was interested in series expansions of 
special functions and also contributed to Chebyshev and least squares approximation. The “Gram 
determinant” was introduced by him in connection with his study of linear independence. 


Erhard Schmidt (1876-1959), a student of Hilbert, became a prominent member of the Berlin 
School of Mathematics, where he founded the Institute of Applied Mathematics. He is considered 
one of the originators of Functional Analysis, having contributed substantially to the theory of 
Hilbert spaces. His work on linear and nonlinear integral equations is of lasting interest, as is his 
contribution to linear algebraic systems of infinite dimension. He is also known for his proof of the 
Jordan curve theorem. His procedure of orthogonalization was published in 1907 and today also 
carries the name of Gram. It was known, however, already to Laplace. 
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0 PF On ©, 


Fig. 2.2 Least squares approximation as orthogonal projection 


has a unique solution @ = @, given by (2.22). There are many ways we can select 
a basis 7; in ®, and, therefore, many ways the solution @» can be represented. 
Nevertheless, it is always one and the same function. The least squares error — the 
quantity on the right-hand side of (2.26) — therefore is independent of the choice of 
basis functions (although the calculation of the least squares solution, as mentioned 
previously, is not). In studying this error, we may thus assume, without restricting 
generality, that the basis 7; is an orthogonal system. (Every linearly independent 
system can be orthogonalized by the Gram—Schmidt orthogonalization procedure; 
cf. Sect. 2.1.2.) We then have (cf. (2.24)) 


s aie ? (1;,f) 
Qr(t) = C;1;(t), ¢; = ———. (2.27) 
dX itj J (xt; .7;) 
We first note that the error f — @, is orthogonal to the space ®,,; that is, 
(f —@n.¢) = 9 for all gp € Py, (2.28) 


where the inner product is the one in (2.10). Since ¢ is a linear combination of the 
Itz, it suffices to show (2.28) for each gy = mj, k = 1,2,..., n. Inserting ¢, from 
(2.27) in the left-hand side of (2.28), and using orthogonality, we find indeed 


Cf —Grm) =| f— >) bpm. me | = Amn) — ec (ae, me) = 0, 
j=l 


the last equation following from the formula for ¢, in (2.27). The result (2.28) has 
a simple geometric interpretation. If we picture functions as vectors, and the space 
®, as a plane, then for any f that “sticks out” of the plane ©®,,, the least squares 
approximant @, is the orthogonal projection of f onto ®,; see Fig. 2.2. 

In particular, choosing y = @, in (2.28), we get 


(f = Gn» Pn) =0 
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and, therefore, since f = (f —@,) + Qn, by the Theorem of Pythagoras (cf. (2.14)) 
and its generalization (cf. (2.16)), 


IFIP =F — Gall? + WGnll? 
2 


=f -Gh ay yo 


n 
= lf —@ll? + Sle Pllzy IP. 
j=l 


Solving for the first term on the right-hand side, we get 


Nie 


4s n . ; (a;.f) 
If -—Gll = 412 -ie Pile}. 4 = SEE. 
2 J J oh (xt, 7;) 


j= 


(2.29) 


Note that the expression in braces must necessarily be nonnegative. 

The formula (2.29) for the error is interesting theoretically, but of limited 
practical use. Note, indeed, that as the error approaches the level of the machine 
precision eps, computing the error from the right-hand side of (2.29) cannot produce 
anything smaller than ./eps because of inevitable rounding errors committed during 
the subtraction in the radicand. (They may even produce a negative result for the 
radicand.) Using instead the definition, 


ie 


ee ; Lf) — a year)”, 


along, perhaps, with a suitable (positive) quadrature rule (cf. Chap. 3, Sect. 3.2), 
is guaranteed to produce a nonnegative result that may potentially be as small as 
O(eps). 

If we are now given a sequence of linear spaces ®,, 1 = 1,2,3,..., as defined 
in (2.2), then clearly 


If -All 2 If -Gll2 Ilf-@ll>---. 


which follows not only from (2.29), but more directly from the fact that ®; C ®2 C 
®3 C --- . If there are infinitely many such spaces, then the sequence of L2 errors, 
being monotonically decreasing, must converge to a limit. Is this limit zero? If so, we 
say that the least squares approximation process converges (in the mean) as n — oo. 
It is obvious from (2.29) that a necessary and sufficient condition for this is 


oe) 


Yelle? = IP. (2.30) 


= 
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An equivalent way of stating convergence is as follows: given any f with || f'|| < 
oo, that is, any f in the L2 4, space, and given any e > 0, no matter how small, there 
exists an integer n = n, and a function g* € ©, such that || f — g*|| < e. A class 
of spaces ®, having this property is said to be complete with respect to the norm 
| - || = || - llz,aa. One therefore calls (2.30) also the completeness relation. 

For a finite interval [a,b], one can define completeness of {®,,} also for the 
uniform norm || - || = || - |loo on [a,b]. One then assumes f € C[a, b] and also 
a; € C[a,b] for all basis functions in all classes ®,, and one calls {®,,} complete 
in the norm || - |loo if for any f € Cla, b] and any ¢ > 0 there is ann =n, anda 
gy* € ®, such that || f — g* loo < &. It is easy to see that completeness of {®,,} in 
the norm || - ||oo (on [a, b]) implies completeness of {®,,} in the Lz norm || - |/2.a,, 
where supp dA = [a, b], and hence convergence of the least squares approximation 
process. Indeed, let ¢ > 0 be arbitrary and let n and y* € ©®, be such that 


é 


If — ¢" lloo < ————;. 
[2o) 
R 


This is possible by assumption. Then 
2 
If -¢'na = ( [170 - oP) 


< If - "lle ([ao). a (Jao). - 


Nie 


as claimed. 


Example: ®, = P,-1. 

Here completeness of {®,} in the norm || - ||oo (on a finite interval [a,b]) is 
a consequence of Weierstrass’s Approximation Theorem. Thus, polynomial least 
squares approximation on a finite interval always converges (in the mean). 


2.1.4 Examples of Orthogonal Systems 


There are many orthogonal systems in use. The prototype of them all is the system 
of trigonometric functions known from Fourier analysis. Other widely used systems 
involve algebraic polynomials. We restrict ourselves here to these two particular 
examples of orthogonal systems. 
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1. Trigonometric functions: 1, cost, cos2t, cos3t ,..., sint, sin2f, sin3rf,... . 
These are the basic harmonics; they are mutually orthogonal on the interval 
[0, 27] with respect to the equally weighted measure on [0,27], 


dt on [0,27], 
dA(t) = (2.31) 
0 otherwise. 


We verify this for the sine functions: for k, £2 = 1,2,3,... we have 
2n 1 Qn 
/ sinkt- sin €t dt = = / [cos(k + £)t — cos(k — £)t] dt. 
0 0 


The right-hand side is equal to 


Lfsin(k + Q¢  sin(k —Q17°" = 
2 k+¢ k—-£ -- 
when k # @, and equal to z otherwise. Thus, 
20 0 if k x c. 
i sinkt-sin€t dt = k,@=1,2,3,.... (2.32) 
: x if k=€, 
Similarly, one shows that 
0 if kK, 
2n 
/ coskt-cos€t dt = ie df Loe SO k,€=0,1,2,... (2.33) 
0 ? 
xn if k=€>0, 
and 
2n 
if sinkt-cos€t dt =0, k =1,2,3,..., £=0,1,2,.... (2.34) 
0 


The theory of Fourier series is concerned with the expansion of a given 27- 
periodic function in terms of these trigonometric functions, 


f(t) = Yo ax coskt + ¥ > by sinkt. (2.35) 


k=0 k=1 
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Using (2.32)-(2.34), one formally obtains 


1 2n 1 2n 
ay = — F(t) dt, a =— | f(t)coskt dt, k =1,2,..., 
20 0 T Jo 


bx 


II 


1 20 
- f(t) sinkt dt, k =1,2,..., (2.36) 
0 


which are known as Fourier coefficients of f . They are precisely the coefficients 
(2.24) for the system z; consisting of our trigonometric functions. By extension, 
one therefore calls the coefficients ¢; in (2.24), for any orthogonal system z;, the 
Fourier coefficients of f relative to this system. In particular, we now recognize 
the truncated Fourier series (the series on the right-hand side of (2.35) truncated 
atk = m, with ax, by given by (2.36)) as the best L2 approximation to f from the 
class of trigonometric polynomials of degree < m relative to the norm (cf. (2.31)) 


Qn 7 
uta = (f j(r)Par) 


2. Orthogonal polynomials: given a measure dA as introduced in (2.3)-(2.5), we 
know from the example immediately following (2.7) that any finite number of 
consecutive powers 1, f, f7,... are linearly independent on [a, b], if suppdA = 
[a, b], whereas the finite set 1, t,...,¢ ~! is linearly independent on supp dA = 
{ti, t2,...,tw}. Since a linearly independent set can be orthogonalized by Gram— 
Schmidt (cf. Sect. 2.1.2), any measure dA of the type considered generates 
a unique set of (monic) polynomials z;(t) = mj(t;dA), j = 0,1,2,..., 
satisfying 


degx; = j, j =0,1,2,..., 


[xonoae =Oifk ££. (2.37) 
R 


These are called orthogonal polynomials relative to the measure dA. (We slightly 
deviate from the notation in Sects. 2.1.2 and 2.1.3 by letting the index j7 start 
from zero.) The set z; is infinite if suppdA = [a,b], and consists of exactly N 
polynomials zo, 71,...,™y—1 if suppdA = {t1,t,...,ty}. The latter are referred 
to as discrete orthogonal polynomials. 

It is an important fact that three consecutive orthogonal polynomials are linearly 
related. Specifically, there are real constants a, = a,(dA) and positive constants 
Bx = Bx (dA) (depending on the measure dA) such that 


Te+i(t) = (¢ — a) Met) — Beme-i(t), k =0,1,2,..., 
x(t) =0, mo(t) = 1. (2.38) 
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(It is understood that (2.38) holds for all integers k > 0 if suppdA = [a,b], and 
only for0 < k < N —1if suppd = {t),t,...,ty}.) 

To prove (2.38) and, at the same time identify the coefficients a,, B,, we note 
that 


Mke+1(t) — tae (t) 
is a polynomial of degree < k, since the leading terms cancel (the polynomials z; 


are assumed monic). Since an orthogonal system is linearly independent (cf. the 
remark after (2.16)), we can express this polynomial as a linear combination of zo, 


Jt,,..., 70%. We choose to write this linear combination in the form: 
k-2 
meti(t) — tr (t) = —agme(t) — Bemei(t) + D> vet) (2.39) 
j=0 


(with the understanding that empty sums are zero). Now multiply both sides 
of (2.39) by zx in the sense of the inner product (- , -) defined in (2.10). By 


orthogonality, this gives (—ta,, 1) = —Ox (1K, 1K); that is, 
t ’ 
oy, = CT) p= 0,1,2,.... (2.40) 
(1, 1k) 


Similarly, forming the inner product of (2.39) with m,_1 gives (—t7,, 1-1) = 
—Bx(K-1, We-1). Since (t7,, We-1) = (1K, tHR-1) and tr, differs from mz, by 
a polynomial of degree < k, we obtain by orthogonality (t2,, m.-1) = (1x, 1k); 
hence 


-_ (1k, 1k) 


= _ k=1,2,.... (2.41) 
(7-1, Uk—-1) 


Finally, multiplication of (2.39) by me,  < k — 1, yields 


me=0, £=0,1,...,k-2. (2.42) 


Solving (2.39) for 2,4 then establishes (2.38), with a,, By defined by (2.40) 
and (2.41), respectively. Clearly, 6; > 0. By convention, By = f,dA(t) = 
Jp mM (t)dA(E). 

The recursion (2.38) provides us with a practical scheme of generating orthogo- 
nal polynomials. Indeed, since 179 = 1, we can compute ap by (2.40) with k = 0. 
This allows us to compute z(t) for any ¢, using (2.38) with k = 0. Knowing zo, m1, 
we can go back to (2.40) and (2.41) and compute, respectively, a and 6). This gives 
us access to 7 via (2.38) with k = 1. Proceeding in this fashion, using alternately 
(2.40), (2.41), and (2.38), we can generate as many orthogonal polynomials as are 
desired. This procedure — called Stieltjes’s procedure — is particularly well suited 
for discrete orthogonal polynomials, since the inner product is then a finite sum, 
(u,v) = Sal w,u(t;)v(t;) (cf. (2.5)), so that the computation of the a,, By, from 


2.1 Least Squares Approximation 71 


(2.40) and (2.41) is straightforward. In the continuous case, the computation of the 
inner product requires integration, which complicates matters. Fortunately, for many 
important special measures dA(t) = w(t)d?, the recursion coefficients are explicitly 
known (cf. Chap. 3, Table 3.1). In these cases, it is again straightforward to generate 
the orthogonal polynomials by (2.38). 

The special case of symmetry (i.e., dA(t) = w(t)dt with w(—t) = w(t) and 
supp(dA) symmetric with respect to the origin) deserves special mention. In this 
case, defining p;(t) = (—1)*,(—1), one obtains by a simple change of variables 
that (pr, pe) = (—1)**"(atg, me) = Oif k F L. Since px is monic, it follows by 
uniqueness that px (t) = m,(t); that is, 


(—1)' x, (—t) = a(t) (dA symmetric). (2.43) 


Thus, if k is even, then z, is an even polynomial, that is, a polynomial in 7”. 
Likewise, when k is odd, z; contains only odd powers of t. As a consequence, 


a, =0 forallk >0O (dA symmetric), (2.44) 


which also follows from (2.40), since the numerator on the right-hand side of this 
equation is an integral of an odd function over a symmetric set of points. 


Example: Legendre*® polynomials. 
We may introduce the monic Legendre polynomials by 


m(t) = (—1)* tory. RHO 1D ccc, (2.45) 


ana 


which is known as the Rodrigues formula. 
We first verify orthogonality on the interval [—1, 1] relative to the measure 
dA(t) = dt. For any ¢ with 0 < £ < k, repeated integration by parts gives 


1 dk 
ic ae = +f ‘at = SC 1y"e(e _— 1)-- “(¢ —m+ ne 
7 m=0 
dk-m-1 1 
x ———_(1 = a = 0, 
dtk-m-1 4 


the last equation since 0 < k —m—1 <k. Thus, 


(x, p) =0 forevery p € Px-1, 


3 Adrien Marie Legendre (1752-1833) was a French mathematician active in Paris, best known 
not only for his treatise on elliptic integrals but also famous for his work in number theory and 
geometry. He is considered as the originator (in 1805) of the method of least squares, although 
Gauss had already used it in 1794, but published it only in 1809. 
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proving orthogonality. Writing (by symmetry) 
m(t) =t* + pth? +--+, k= 2, 
and noting (again by symmetry) that the recurrence relation has the form 
Me+i(t) = tre (t) — Beri (0), 


we obtain 
_ tre(t) — mei) 
m-1(t) 


Br 


> 


which is valid for all ¢. In particular, as t > ov, 


— tee(t) — meg) 4 (ee — Megat te 
Be = jim, ———— = lim = Uk — k+1- 


00 Te-1(t) 100 thal eee 


(if k = 1, set 4; = 0.) From Rodrigues’s formula, however, we find 


m(t) = aN AO Gs —kre? 4...) = ae (2k(2k —1)-+-(k + It* 
(2k)! dek (2k)! 
—k- (2k — 2)(2k — 3)---(k — 1)t*? +---) 
. kKe=T) ps 
= ko TN’ pk 2 eae 
yc Te ne 
so that 
k(k — 1) 
p= a, k22. 
eS Fopaty = 
Therefore, 
k(k -1 k+k k 2k 
Bk = bk — Bet = — Lae a 


2(2k—1) ' 2@k+1)~ 2 Qk+DOQk—1)’ 
that is, since 4; = 0, 
1 


Be =p ke (2.46) 
We conclude with two remarks concerning discrete measures dA with supp dA = 
{t1,to,..., ty}. As before, the Lz errors decrease monotonically, but the last one is 


now zero, since there is a polynomial of degree < N — | that interpolates f at the 
N points t),f2,...,ty (cf. Sect. 2.1.2). Thus, 


If — oll = If -— Gill = --- = IF — @n-ill = 9, (2.47) 


2.2 Polynomial Interpolation vie) 


where @, is the Lz approximant of degree < n, 


(1;, f) 


(1j,0j) 


Ort) = >> ex; (t;da), 4; = 


J=0 


(2.48) 


We see that the polynomial ¢y_; solves the interpolation problem for Py—;. Using 
(2.48) with n = N — 1 to obtain the interpolation polynomial, however, is a 
roundabout way of solving the interpolation problem. We learn of more direct ways 
in the next section. 
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We now wish to approximate functions by matching their values at given points. 
Using polynomials as approximants gives rise to the following problem: givenn + | 
distinct points xo, x1,...,X, and values f; = f(x;) of some function f at these 
points, find a polynomial p € P,, such that 


P(xi) = fi, 1 =0,1,2,...,0 


Since we have to satisfy n + | conditions, and have at our disposal n + 1 degrees of 
freedom — the coefficients of p — we expect the problem to have a unique solution. 
Other questions of interest, in addition to existence and uniqueness, are different 
ways of representing and computing the polynomial p, what can be said about the 
error e(x) = f(x) — p(x) when x # x;,i = 0,1,...,n, and the quality of 
approximation f(x) ~ p(x) when the number of points, and hence the degree of 
Pp, is allowed to increase indefinitely. Although these questions are not of the utmost 
interest in themselves, the results discussed here are widely used in the development 
of approximate methods for more important practical tasks such as solving initial 
and boundary value problems for ordinary and partial differential equations. It is in 
view of these and other applications that we study polynomial interpolation. 

The simplest example is linear interpolation, that is, the case n = 1. Here, it is 
obvious from Fig. 2.3 that the interpolation problem has a unique solution. It is also 
clear that the error e(x) can be as large as one likes (or dislikes) if nothing is known 
about f other than its two values at xo and x). 

One way of writing down the linear interpolant p is as a weighted average of fo 
and f; (already taught in high school), 

X—X] 


p(x) = fo+ 1 
Xo — X14 X1 — Xo 


This is the way Lagrange expressed p in the general case (cf. Sect. 2.1.2). However, 
we can write p also in Taylor’s form, noting that its derivative at xo is equal to the 
“difference quotient,” 


ee 


P(x) = fo ee — Xo). 
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Fig. 2.3. Linear interpolation 


This indeed is a prototype of Newton’s form of the interpolation polynomial 
(cf. Sect. 2.2.6). 

Interpolating to function values is referred to as Lagrange interpolation. More 
generally, we may wish to interpolate to function and consecutive derivative values 
of some function. This is called Hermite interpolation. It turns out that the latter can 
be solved as a limit case of the former (cf. Sect. 2.2.7). 


2.2.1 Lagrange Interpolation Formula: Interpolation Operator 


We prove the existence of the interpolation polynomial by simply writing it down. 
It is clear, indeed, that 


é(x) = [] —. i=01,....0, (2.49) 


is a polynomial of degree n that interpolates to 1 at x = x; and to 0 at all the other 
points. Multiplying it by f; produces the correct value at x;, and then adding up the 
resulting polynomials, 


px) = >> fei), 
i=0 


produces a polynomial, still of degree < n, that has the desired interpolation 
properties. To prove this formally, note that 
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tooj=ti= i= 0 1pccsam (2.50) 


Therefore, 
P(Xk) = y Foe = yk =f. b= 0 iat 
i=0 i=0 
This establishes the existence of the interpolation polynomial. To prove uniqueness, 
assume that there are two polynomials of degree < n, say, p and p*, both 
interpolating to f at x;,i =0,1,...,n. Then 


d(x) = p(x) — p*(x) 
is a polynomial of degree < n that satisfies 
d(xi) = fi-fi =0, i =0,1,...,n. 


In other words, d has n + 1 distinct zeros x;. There is only one polynomial in P,, 
with that many zeros, namely, d(x) = 0. Therefore, p*(x) = p(x). 
We denote the unique polynomial p € P,, interpolating f at the (distinct) points 
KOy M1 yan eX DY 
Pal f 3X0, X1,--+,%niX) = Pn( fx), (2.51) 


where we use the long form on the left-hand side if we want to place in evidence the 
points at which interpolation takes place, and the short form on the right-hand side 
if the choice of these points is clear from the context. We thus have what is called 
the Lagrange’ interpolation formula 


POs > FO)66), (2.52) 
i=0 


with the £;(x) — the elementary Lagrange interpolation polynomials — defined in 
(2.49). 


4Joseph Louis Lagrange (1736-1813), born in Turin, became, through correspondence with Euler, 
his protégé. In 1766 he indeed succeeded Euler in Berlin. He returned to Paris in 1787. Clairaut 
wrote of the young Lagrange: “ ... a young man, no less remarkable for his talents than for 
his modesty; his temperament is mild and melancholic; he knows no other pleasure than study.” 
Lagrange made fundamental contributions to the calculus of variations and to number theory, and 
worked also on many problems in analysis. He is widely known for his representation of the 
remainder term in Taylor’s formula. The interpolation formula appeared in 1794. His Mécanique 
Analytique, published in 1788, made him one of the founders of analytic mechanics. 
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It is useful to look at Lagrange interpolation in terms of a (linear) operator P,, 
from (say) the space of continuous functions to the space of polynomials P,,, 


P,: Cla,b] > Py, p(-) = pr(f; -). (2.53) 


The interval [a, b] here is any interval containing all points x;,i = 0,1,...,n. The 
operator P,, has the following properties: 

1. P,(af)=aP, f, a €R (homogeneity); 

2. Pi(f +g) =Pif + Prg (additivity). 


Combining | and 2 shows that P,, is a linear operator, 


Pi(af + Bg) = aPn f + BPng, a,pBeR. 


3. Pif =f forall f EP,. 


The last property — an immediate consequence of uniqueness of the interpolation 
polynomial — says that P,, leaves polynomials of degree < n unchanged, and hence 
is a projection operator. 

A norm of the linear operator P,, can be defined (similarly as for matrices, 
cf. Chap. 1, (1.30)) by 


(2.54) 


Pi 
[Pall = max || Pa ll 


m ; 
feClab] || fl 


where on the right-hand side one takes any convenient norm for functions. Taking 
the Loo norm (cf. Table 2.1), one obtains from Lagrange’s formula (2.52) 


max 
a<x<b 


> fae) 
i=0 


IA 


oo £i(x)|. 2. 
If oD (x)| (2.55) 


Indeed, equality holds for some continuous function f; cf. Ex. 30. Therefore, 
Il Pn lloo = An, (2.56) 


where 


An = Palle Ae O= ¥ 1009). (2.57) 
i=0 
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The function A,,(x) and its maximum A, are called, respectively, the Lebesgue? 
function and Lebesgue constant for Lagrange interpolation. They provide a first 
estimate for the interpolation error: let €,,(f) be the best (uniform) approximation 
of f on [a,b] by polynomials of degree < n, 


En(f) = min || f — plloo = || f — Pulloo, (2.58) 
pePn 


where fp, is the nth-degree polynomial of best uniform approximation to f. Then, 
using the basic properties 1-3 of P,, in particular, the projection property 3, and 
(2.55) and (2.57), one finds 


If — Pals loo = If — Bn — Pu(F — Bai + )Iloo 
< If — Palloo + Anll f — Brlloo: 
that is, 
IF — Pn Fs )llo S A+ AnE(f). (2.59) 


Thus, the better f can be approximated by polynomials of degree < n, the smaller 
the interpolation error. Unfortunately, A,, is not uniformly bounded: no matter how 
one chooses the nodes x; = a, i = 0,1,...,n, one can show that always A, > 
O(logn) as n — on. It is not possible, therefore, to conclude from Weierstrass’s 
approximation theorem (i.e., from E,,(f) — 0, — oo) that Lagrange interpolation 
converges uniformly on [a, b] for any continuous function, not even for judiciously 
selected nodes; indeed, one knows that it does not. 


2.2.2 Interpolation Error 


As noted earlier, we need to make some assumptions about the function f in order 
to be able to estimate the error of interpolation, f(x) — p,(f;x), for any x # x; in 
[a, b]. In (2.59) we made an assumption in terms of how well f can be approximated 
on [a,b] by polynomials of degree < n. Now we make an assumption on the 
magnitude of some appropriate derivative of f. 


>Henri Leon Lebesgue (1875-1941) was a French mathematician best known for his work on the 
theory of real functions, notably the concepts of measure and integral that now bear his name. These 
became fundamental in many areas of mathematics such as functional analysis, Fourier analysis, 
and probability theory. He has also made interesting contributions to the calculus of variations, the 
theory of dimension, and set theory. 
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It is not difficult to guess how the formula for the error should look: since the 
error is zero at each x;,i = 0,1,...,, we ought to see a factor of the form (x — xo) 
(x — x1)+++(x — x,,). On the other hand, by the projection property 3 in Sect. 2.2.1, 
the error is also zero (even identically so) if f € P,,, which suggests another factor 
— the (n + 1)st derivative of f. But evaluated where? Certainly not at x, since 
f would then have to satisfy a differential equation. So let us say that fT is 
evaluated at some point € = &(x), which is unknown but must be expected to 
depend on x. Now if we test the formula so far conjectured on the simplest nontrivial 
polynomial, f(x) = x”"*!, we discover that a factor 1/(n + 1)! is missing. So, our 
final (educated) guess is the formula 


fLOTYER)) 


F(x) = pn( fsx) = (n +1)! 


[ [@-2). x € [a.d). (2.60) 
1=0 


Here &(x) is some number in the open interval (a, b), but otherwise unspecified, 
a<&(x) <b. (2.61) 


The statement (2.60) and (2.61) is, in fact, correct if we assume that f € 
C"t!la, b]. An elegant proof of it, due to Cauchy,° goes as follows. We can assume 
x # x; fori = 0,1,...,n, since otherwise (2.60) would be trivially true for any 
£(x). So, fix x € [a, b] in this manner, and define a function F of the new variable 
t as follows: 


Fj=f(O=n oe ua 


[[@ —x;) ‘9 
i=0 


(2.62) 


Clearly, F ¢ C”*'a, b]. Furthermore, 
F(x) =0,i =0,1,...,n; F(x) = 0. 


Thus, F has n + 2 distinct zeros in [a,b]. Applying repeatedly Rolle’s Theorem, 
we conclude that 


® Augustin Louis Cauchy (1789-1857), active in Paris, is truly the father of modern analysis. He 
provided a firm foundation for analysis by basing it on a rigorous concept of limit. He is also 
the creator of complex analysis, of which “Cauchy’s formula” (cf. (2.70)) is a centerpiece. In 
addition, Cauchy’s name is attached to pioneering contributions to the theory of ordinary and partial 
differential equations, in particular, regarding questions of existence and uniqueness. As with 
many great mathematicians of the eighteenth and nineteenth centuries, his work also encompasses 
geometry, algebra, number theory, and mechanics, as well as theoretical physics. 
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F' has at least n + 1 distinct zeros in (a,b) 
F" has at least n_ distinct zeros in (a,b) 
F' has at least n — 1 distinct zeros in (a,b) 


F’*) has at least 1 zero in (a,b) 


since F*") is still continuous on [a,b]. Denote by &(x) a zero of F“* whose 
existence we just established. It certainly satisfies (2.61) and, of course, will depend 
on x. Now differentiating F in (2.62) n + | times with respect to ¢, and then setting 
t = &(x), we get 


0= f*Y E(x)) _ ft) — Prl( fx) “(n+ DI, 


] [@ - +) 
i=0 


which, when solved for f(x) — pn(f; x), gives precisely (2.60). Actually, what we 


have shown is that &(x) is contained in the span of x0, X1,...,Xn, X, that is, in the 
interior of the smallest closed interval containing xo,.x1,...,X, and x. 
Examples. 1. Linear interpolation (n = 1). Assume that x9 < x < xj; that is, 


[a, b] = [xo, x1], and let h = x; — xo. Then by (2.60) and (2.61), 


M 
FO) — nF) = xe — LO, xy <8 <a, 
and an easy computation gives 
M- 
If — pif lho < rh, Ma = If'"loo- (2.63) 


Here the oo-norm refers to the interval [xo, x]. Thus, on small intervals of length 
h, the error for linear interpolation is O(h7). 

2. Quadratic interpolation (n = 2) on equally spaced points x9, x1 = Xo +A, 
X2 = Xo + 2h. We now have, for x € [Xo, x2], 


f"® 


F(x) — p2o( fsx) = (x — x0) (x — x1) (* — x2) zZ 


» X09 <E < x2, 
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ex 


Fig. 2.4 Interpolation error for eight equally spaced points 


and (cf. Ex. 43(a)) 


M. 
If — P23 loo < oat M3 = |f"" loo, 


giving an error of O(h*). 


. nth-degree interpolation on equally spaced points x; = xo +ih,i =0,1,...,n. 


When / is small, and x9 < x < x,, then &(x) in (2.60) is constrained to a 
relatively small interval and f+! (&(x)) cannot vary a great deal. The behavior 
of the error, therefore, is mainly determined by the product | [/_o(x — x;), the 
graph of which, for 1 = 7, is shown in Fig. 2.4. We clearly have symmetry 
with respect to the midpoint (xo + x,)/2. It can also be shown that the relative 
extrema decrease monotonically in modulus as one moves from the endpoints to 
the center (cf. Ex. 29(c)). 

It is evident that the oscillations become more violent as n increases. In 
particular, the curve is extremely steep at the endpoints, and takes off to co 
rapidly as x moves away from the interval [xo, x, ]. Although it is true that the 
curve representing the interpolation error is scaled by a factor of O(h"*!), it 
is also clear that one ought to interpolate near the center zone of the interval 
[Xo0, X;], if at all possible, and should avoid interpolation near the end zones, or 
even extrapolation outside the interval. The highly oscillatory nature of the error 
curve, when 7 is large, also casts some legitimate doubts about convergence of 
the interpolation process as n —> oo. This is studied in the next section. 
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2.2.3 Convergence 


We first must define what we mean by “convergence.” We assume that we are given 


a triangular array of interpolation nodes x; = Pia exactly n + 1 distinct nodes for 


eachn = 0,1,2,...: 


(0) 
Xo 


x a 


(2) ,(2) ,(@ 
Xx Xx Xx. 
eh (2.64) 


a" x” a? —_ (n) 


We further assume that all nodes x) are contained in some finite interval [a, b]. 


Then, for each n, we define 
P= Te a on ROOD), BE [a4 I. (2.65) 


We say that Lagrange interpolation based on the triangular array of nodes (2.64) 
converges if 


Pn(x) > f(x) as n > 0, (2.66) 


uniformly for x € [a, b]. 
Convergence clearly depends on the behavior of the kth derivative f of f as 
k — oo. We assume that f € C™[a, b], and that 


|f)| < Mp fora<x<b, k = 0,1,2,.... (2.67) 


Since |x — x!””| < b — a whenever x € [a,b] and x” € [a, b], we have 


Ga @aaP) Gayla b=a™, (2.68) 
so that by (2.60) 
M, 
_ _ n+l1 n+l : 
| f(x) — pn(x)| < (6 — a) map ** [a,b]. 
We therefore have convergence if 
_ (b—a)* 
lim ————— M, = 0. (2.69) 


k—oo k! 
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Fig. 2.5 The circular disk C, 


We now show that (2.69) is true if f is analytic in a sufficiently large region in 
the complex plane containing the interval [a, b]. Specifically, let C, be the circular 
(closed) disk with center at the midpoint of [a, b] and radius r, and assume, for the 
time being, that r > 5(b — a), so that [a,b] C C,. Assume f analytic in C,. Then 
we can estimate the derivative in (2.67) by Cauchy’s Formula, 


f©@ = =f TO de. eit. (2.70) 


Gx) 


Noting that |z— x| > r— $(b — a) (cf. Fig. 2.5), we obtain 


& 1 268C, 
lf @)I nF - iba 2nr. 


Therefore, we can take for M;, in (2.67) 


k! 
x | FOI 


Vice: eae eee 
«7 —1(b=a) : [r—Lb-alt 


(2.71) 


and (2.69) holds if 


k 
ae ae —>O0ask>ow, 
r—3(b—a) 


that is, ifb-—a<r— t(b — a), or, equivalently, 


r > 3(b—a). (2.72) 
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We have shown that Lagrange interpolation converges (uniformly on [a, b]) for an 
arbitrary triangular set of nodes (2.64) (all contained in [a, b]) if f is analytic in 
the circular disk C,. centered at (a + b)/2 and having radius r sufficiently large so 
that (2.72) holds. 

Since our derivation of this result used rather crude estimates (see, in particular, 
(2.68)), the required domain of analyticity for f that we found is certainly not sharp. 
Using more refined methods, one can prove the following. Let djz(t) be the “limit 
distribution” of the interpolation nodes, that is, let 


[ du(t), a<x<b, 


be the ratio of the number of nodes x in [a,x] to the total number, n + 1, of 
nodes, asymptotically as n — oo. (When the nodes are uniformly distributed over 
the interval [a,b], then du(t) = dt/(b — a).) A curve of constant logarithmic 
potential is the locus of all complex z € C such that 


b 
i= wes / In —— du(), 


Iz—¢| 
where y is a constant. For large negative y, these curves look like circles with large 
radii and center at (a+ b)/2. As y increases, the curves “shrink” toward the interval 
[a, b]. Let 

IT = supy, 


where the supremum is taken over all curves u(z) = y containing [a, b] in their 
interior. The important domain (replacing C,.) is then the domain 


Cr ={zEC: uz) =TP}, (2.73) 


in the sense that if f is analytic in any domain C containing Cy in its interior (no 
matter how closely C covers Cr), then 


f(z) — pr(f;32| > 0 as n> 00 (2.74) 
uniformly for z € Cp. 


Examples. 1. Equally distributed nodes: du(t) = dt/(b —a),a <t < b. In this 
case, Cr is a lens-shaped domain with tips at a and b, as shown in Fig. 2.6. Thus, 
we have uniform convergence in Cr (not just on [a, b], as before) provided f is 
analytic in a region slightly larger than Cr. 


1 t 
2. Arc sine distributi 1,1]: du(t) = —————. Here th d 
re sine distribution on [ ]: du(t) oy Ee: ere the nodes are more 


densely distributed near the endpoints of the interval [—1,1]. It turns out that in 
this case Cr = [-1,1], so that Lagrange interpolation converges uniformly on 
[-1,1] if f is “analytic on [1,1],” that is, analytic in any region, no matter how 
thin, that contains the interval [—1,1] in its interior. 
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Cr 


F 


Fig. 2.6 The domain Cr for uniformly distributed nodes 


. Runge’s’ example: 


1 
= ——., <x SS, 
A 10 
x =-54+k—, k=0,1,2,...,n. (2.75) 
n 


Here the nodes are equally spaced, hence asymptotically equally distributed. 
Note that f(z) has poles at z = + i. These poles lie definitely inside the region 
Cr in Fig. 2.6 for the interval [-5,5], so that f is not analytic in Cr. For this 
reason, we can no longer expect convergence on the whole interval [—5,5]. It has 
been shown, indeed, that 


. 0 if |x| < 3.633..., 
im. | f(x) ae Pal fi x)| — (2.76) 
co. if |x| > 3.633.... 


We have convergence in the central zone of the interval [—5,5], but divergence in 
the lateral zones. With Fig. 2.4 kept in mind, this is perhaps not all that surprising 
(cf. MA 7(b)). 


. Bernstein’s® example: 


f(x) = |x|, -l<x<1, 


aie =-1+—, k=0,1,2,...,0. (2.77) 


7Carl David Tolme Runge (1856-1927) was active in the famous Gottingen school of mathematics 
and is one of the early pioneers of numerical mathematics. He is best known for the Runge-Kutta 
formula in ordinary differential equations (cf. Chap. 5, Sect. 5.6.5), for which he provided the basic 
idea. He made also notable contributions to approximation theory in the complex plane. 


8Sergei Natanovié BernStein (1880-1968) made major contributions to polynomial approximation, 
continuing in the tradition of his countryman Chebyshev. He is also known for his work on partial 
differential equations and probability theory. 
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Here analyticity of f is completely gone, f being not even differentiable at 
x = 0. Accordingly, one finds that 


lim | f(x) — pn(f;x)| = co for every x € [-1, 1], 
noo 


except x = —1, x =0, and x = 1. (2.78) 


The fact that x = + | are exceptional points is trivial, since they are interpolation 
nodes, where the error is zero. The same is true for x = 0 when n is even, but 
not if 7 is odd. 

The failure of convergence in the last two examples can only in part be blamed 
on insufficient regularity of f. Another culprit is the equidistribution of the 
nodes. There are indeed better distributions, for example, the arc sine distribution 
of Example 2. An instance of the latter is discussed in the next section. 

We add one more example, which involves complex nodes, and for which the 
preceding theory, therefore, no longer applies. We prove convergence directly. 

5. Interpolation at the roots of unity (Fejér?): z, = exp(k2zi/n), k = 1,2,...,n. 
We show that 


Pn-i(f 32) > f(2), n> oo, forany |z| <1, (2.79) 
uniformly in any disk |z| < p < 1, provided f is analytic in |z| < 1 and 


continuous on |z| < 1. 
We have 


n 
= n 
On(Z) 1= | [«-« =2"=1, 0 @) =n = a 
k 
k=1 


so that the elementary Lagrange polynomials are 


Mn (z) Zt 1 
{2 = — =5 
On (Ze )(Z— 2%) F(z ze) 
_ eet a: 
 AE—Zz (c—z)n’ 


°Leopold Fejér (1880-1959) was a leading Hungarian mathematician of the twentieth century. 
Interestingly, Fejér had great difficulties in mathematics at the elementary and lower secondary 
school level, and even required private tutoring. It was an inspiring teacher in the upper-level 
secondary school who awoke Fejér’s interest and passion for mathematics. He went on to discover 
— still a university student — an important result on the summability of Fourier series, which made 
him famous overnight. He continued to make further contributions to the theory of Fourier series, 
but also occupied himself with problems of approximation and interpolation in the real as well 
as complex domain. He in turn was an inspiring teacher to the next generation of Hungarian 
mathematicians. 
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Therefore, 


paioa= >, cals egy ee (2.80) 
k=1” 


= n 
k=l g Bk 


We interpret the first sum as a Riemann sum of an integral extended over the unit 
circle: 


3 F(z) £k _ 1 3 fer"), jek 2/n : 20 
Oni 


1 Zk —zn eik2x/n _ ey n 


sd id 
* £0) wap 1 f SOs 


Qni ei — 7" 2mi Jigar 6-Zz 


as Nl > ©. 


The last ee by Cauchy’s Formula, however, is precisely f(z). The 
second term in (2.80), being just —z” times the first, converges to zero, uniformly 
in |z]}< p<. 


2.2.4 Chebyshev Polynomials and Nodes 


The choice of nodes, as we saw in the previous section, distinctly influences 
the convergence character of the interpolation process. We now discuss a choice 
of points — the Chebyshev points — which leads to very favorable convergence 
properties. These points are useful, not only for interpolation, but also for other 
purposes (integration, collocation, etc.). We consider them on the canonical interval 
[-1,1], but they can be defined on any finite interval [a,b] by means of a linear 
transformation of variables that maps [—1,1] onto [a, 5]. 

We begin with developing the Chebyshev polynomials. They arise from the fact 
that the cosine of a multiple argument is a polynomial in the cosine of the simple 
argument; more precisely, 


cosn@ = T,(cos@), T, € Pn. (2.81) 
This is a consequence of the well-known trigonometric identity 
cos(k + 1)@ + cos(k — 1)0 = 2cos@coské, 
which, when solved for the first term, gives 
cos(k + 1)@ = 2cos@ cosk@ — cos(k — 1)0. (2.82) 
Therefore, if cosm@ is a polynomial of degree m in cos @ for all m < k, then the 


same is true form = k + 1. Mathematical induction then proves (2.81). At the same 
time, it follows from (2.81) and (2.82), if we set cos @ = x, that 
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Tr1(%) = 2xT, (x) — Th-1(x), k = 1,2,3,..., 
To(x) = 1, T(x) =x. (2.83) 


The polynomials T,, so defined are called the Chebyshev polynomials (of the first 
kind). Thus, for example, 


T(x) = 2x? — 1, 
T3(x) = 4x? — 3x, 
T4(x) = 8x* — 8x? +1, 


and so on. 

Clearly, these polynomials are defined not only for x in [-1,1], but also for 
arbitrary real or complex x. It is only that on the interval [—1,1] they satisfy the 
identity (2.81) (where @ is real). 

It is evident from (2.83) that the leading coefficient of J, is 2”! (if n > 1); the 
monic Chebyshev polynomial of degree n, therefore, is 


° 1 ° 
Trax) = Heat In(*), n>1; To = To. (2.84) 


The basic identity (2.81) allows us to immediately obtain the zeros x; = = of 
T,: indeed, cosn@ = Oifn@ = (2k — 1)z/2, so that 


n n n 2k —1 
x = cos of . 6. = 2, | a ee (2.85) 


All zeros of 7, are thus real, distinct, and contained in the open interval (—1,1). They 
are the projections onto the real line of equally spaced points on the unit circle; 
cf. Fig. 2.7 for the case n = 4. 

In terms of the zeros x" of T;,, we can write the monic polynomial in factored 
form as 


T,(x)=[] (x- x”). (2.86) 


As we let 6 increase from 0 to 2, hence x = cos@ decrease from +1 to —1, 
(2.81) shows that 7;,(x) oscillates between +1 and —1, attaining these extreme 
values at 


n n n 8 
y” = cos n®”, ng =k, k =0,1,2,...,n. (2.87) 
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Fig. 2.7 The Chebyshev polynomial y = 74(x) 


In summary, then, 


(n) 2k —1 


Th (x;”) =0 for x," = cos mw, k=1,2,...,n; (2.88) 


n 


n n k 
ie ( ’) = (-1)' for yy” = cos —a, k=0,1,2,...,.7. (2.89) 


Chebyshev polynomials owe their importance and usefulness to the following 
theorem, due to Chebyshev. !° 


Theorem 2.2.1. For an arbitrary monic polynomial Pn of degree n, there holds 


° ° 1 
max |p,(x)| > max |7,(x)| = —, n2>1, (2.90) 
—l<x<l —1<x<1 Qn-l 


where T, is the monic Chebyshev polynomial (2.84) of degree n. 


'0Pafnuti Levovich Chebyshev (1821-1894) was the most prominent member of the St. Petersburg 
school of mathematics. He made pioneering contributions to number theory, probability theory, 
and approximation theory. He is regarded as the founder of constructive function theory, but also 
worked in mechanics, notably the theory of mechanisms, and in ballistics. 
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Proof (by contradiction). Assume, contrary to (2.90), that 


max |p,(x)| < (2.91) 


1 
-1<x<l Qn-1 7 


Then the polynomial d,(x) = T,(x) — pe (x) (a polynomial of degree <n — 1), 
satisfies 


dn (yf?) > 0, da (91?) <0, dy (y$?) > 0,-0.5 CD" dr (VX) > 0. 2.92) 


Thus d,, changes sign at least n times, and hence has at least n distinct real zeros. 
But having degree <n — 1, it must vanish identically, d, (x) = 0. This contradicts 
(2.92); thus (2.91) cannot be true. oO 


The result (2.90) can be given the following interesting interpretation: the best 
uniform approximation (on the interval [-1,1]) to f(x) = x" from polynomials in 


P,,-1 is given by x"—T,,(x), that is, by the aggregate of terms of degree < n—1in 7, 
taken with the minus sign. From the theory of uniform polynomial approximation it 
is known that ne best approximant is unique. Therefore, equality in (2.90) can only 


hold if Pa = = Ty. 

What is the significance of Chebyshev polynomials for interpolation? Recall 
(cf. (2.60)) that the interpolation error (on [-1,1], for a function f € C”*![-1, 1]), 
is given by 


£eME) 


I(x) — pn( fx) = (nt)! +1)! 


Te -» 2 [=1,1), (2.93) 


The first factor is essentially independent of the choice of the nodes x;. It is true 
that &(x) does depend on the x;, but we usually estimate f+ by || fT) loo 
which removes this dependence. On the other hand, the product in the second factor, 
including its norm 


n 


| [¢-*) 


i=0 


; (2.94) 


CO 


depends strongly on the x;. It makes sense, therefore, to try to minimize (2.94) over 
all x; € [—1, 1]. Since the product in (2.94) is a monic ae of degree n + 1, it 
follows from Theorem 2.2.1 that the optimal nodes x; = = x" in (2.93) are precisely 
the zeros of T,+1; that is, 


: ee 
5 = cos 2 Fin, §=0,1,2,...,0. (2.95) 
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For these nodes, we then have (cf. (2.90)) 


(n+1) iy 1 
IFO) = Pn(F3 loo < St (2.96) 


One ought to compare the last factor in (2.96) with the much cruder bound given in 
(2.68), which, in the case of the interval [-1,1], is 2”*!. 

Since by (2.93) the error curve y = f — p,(f;-) for Chebyshev points (2.95) 
is essentially equilibrated (modulo the variation in the factor f‘*!), and thus 
free of the violent oscillations we saw for equally spaced points, we would expect 
more favorable convergence properties for the triangular array (2.64) consisting of 
Chebyshev nodes. Indeed, one can prove, for example, that 


Pil Sf: aes ee ..., 43x) > f(x) as n> 00, (2.97) 


uniformly on [-1,1], provided only that f ¢ C![—1,1]. Thus we do not need 
analyticity of f for (2.97) to hold. 

We finally remark — as already suggested by the recurrence relation (2.83) — that 
Chebyshev polynomials are a special case of orthogonal polynomials. Indeed, the 
measure in question is precisely (up to an unimportant constant factor) the arc sine 
measure 


dA(x) EE [-1, 1] (2.98) 
x) = ——— on [-1, : 
V1 — x? 

already mentioned in Example 2 of Sect. 2.1.4. This is easily verified from (2.81) 
and the orthogonality of the cosines (cf. Sect. 2.1.4, (2.33)): 


1 dx c14 
Ti. (x) T? (x) —— =i Tx. (cos 8)T¢ (cos 6)d@ 
[ (x) Te ae )T¢( 
0 iif k#é, 
=| cosk@costOdd=4, if k=L=0, (2.99) 


sa if kK=£>0. 
The Fourier expansion in Chebyshev polynomials (essentially the Fourier cosine 


expansion) is therefore given by 


1 


f(x) = Ss cjTj(x) = 500 + » es T(x), (2.100) 


j=l j=l 
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where 


2 f! dx 
c= x)T; (x) ——., j] = 0,1,2.,.... 2.101 


Truncating (2.100) with the term of degree n gives a useful polynomial approxi- 
mation of degree n, 


T(x) = >. cj T(x), 
j=0 


having an error 


f@)—me) = Do ej TQ) © cng Tri). (2.102) 
j=nt+l 


The approximation on the far right is better the faster the Fourier coefficients c; 
tend to zero. The error (2.102), therefore, essentially oscillates between +c,+4 and 
—Cn+1 aS X varies on the interval [—1,1], and thus is of “uniform” size. This is in 
stark contrast to Taylor’s expansion at x = 0, where the nth-degree partial sum has 
an error proportional to x”*! on [-1,1]. 


2.2.5  Barycentric Formula 


Lagrange’s formula (2.52) is attractive more for theoretical purposes than for 
practical computational work. It can be rewritten, however, in a form that makes 
it efficient computationally, and that also allows additional interpolation nodes to 
be added with ease. Having the latter feature in mind, we now assume a sequential 
set Xo, X1, X2,... Of interpolation nodes and denote by p,,(f; -) the polynomial of 
degree < n interpolating to f at the first n + 1 of them. We do not assume that the 
xX; are in any particular order, as long as they are mutually distinct. 
We introduce a triangular array of auxiliary quantities defined by 


n 1 
i =a Ss |] he a We see 8 C103) 


Xj — Xj 


j=0 

i#i 
The elementary Lagrange interpolation polynomials of degree n, (2.49), can then be 
written in the form 


(n) n 
L n ’ 7 =0,1,..., ; 11 = 7 j)- 2.104 
yy One) i n; @n(x)=][(~—x;). (2.104) 


x- 


£; (x) = 


j=0 
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Dividing Lagrange’s formula through by 1 = )~/_, €;(x), one finds 


n n A 
; DAE) DY) fit —-on(x) 
i=0 ae. re 
Prl( fx) = me fili (x) = n = a A ° 
i=0 £; u n 
da) Fy Or) 
i=0 i=0 
that is, 
n Ph 
So 
ag he 
ni aS 5 x #x; fori =0,1,...,n. (2.105) 
<a 


This expresses the interpolation polynomial as a weighted average of the function 
values f; = f(x;) and is, therefore, called the barycentric formula — a slight 
misnomer, since the weights are not necessarily all positive. The auxiliary quantities 
a involved in (2.105) are those in the row numbered n of the triangular array 
(2.103). Once they have been calculated, the evaluation of p,(f; x) by (2.105), for 
any fixed x, is straightforward and cheap. Note, however, that when x is sufficiently 
close to some x;, the right-hand side of (2.105) should be replaced by fj. 
Comparison with (2.52) shows that 


(n) 
hi 


£;(x) = oe i=0,1,...,7. (2.106) 


In order to arrive at an efficient algorithm for computing the required quantities 
A we first note that, fork > 1, 


‘ yD 

a? =! f= 0,1,...,k-1. (2.107) 

Xj — Xz 

The last quantity AY missing in (2.107) is best computed directly from the 
definition (2.103), 


1 
4 =—__—__, k=. 


=I , 
[Tj=0(% - x) 
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We thus arrive at the following algorithm: 


a =, 
fork = 1,2,...,n do 
y&-D 
A® =i _ § = 0,1,...,4-1, 
Xj — Xk 
(2.108) 
1 
(k) 
AK 


This requires sn (n + 1) subtractions, $(n — 1)n multiplications, and in (n + 3) 
divisions for computing the n+1 quantities AS : yas ..., A” in (2.105). Therefore, 
(2.106) in combination with (2.108) is more efficient than (2.49), which requires 
O(n) operations to evaluate. It is also quite stable, since only benign arithmetic 
operations are involved (disregarding the formation of differences such as x — xj, 
which occur in both formulae). 

If we decide to incorporate the next data point (x,+41, f+1), all we need to do is 
extend the k-loop in (2.108) through + 1, that is, generate the next row of auxiliary 
quantities A’ *), A@t) oo, i. We are then ready to compute p,+1(f; x) 
from (2.105) with n replaced by n + 1. 


2.2.6 Newton’s''! Formula 


This is another way of organizing the work in Sect. 2.2.5. Although the compu- 
tational effort remains essentially the same, it becomes easier to treat “confluent” 
interpolation points, that is, multiple points in which not only the function values, 
but also consecutive derivative values, are given (cf. Sect. 2.2.7). 

Using the same setup as in Sect. 2.2.5, we denote 


Pa(X) = Pal fi Xo, %1,---,Xn3 xX), n =0,1,2,.... (2.109) 


'I Sir Isaac Newton (1643-1727) was an eminent figure of seventeenth century mathematics and 
physics. Not only did he lay the foundations of modern physics, but he was also one of the 
coinventors of differential calculus. Another was Leibniz, with whom he became entangled in 
a bitter and life-long priority dispute. His most influential work was the Philosophiae Naturalis 
Principia Mathematica, often called simply the Principia, one of the greatest work on physics 
and astronomy ever written. Therein one finds not only his ideas on interpolation, but also 
his suggestion to use the interpolating polynomial for purposes of integration (cf. Chap. 3, 
Sect. 3.2.2). 
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We clearly have 


Po(x) = ao, 
Pn (x) = Pn—-1(X) + An (x _ Xo) (x = X)er (x = Xn—1)s 
n= 1,2,3,..., (2.110) 


for some constants do, a1, d2,.... This gives rise to a new form of the interpolation 
polynomial, 


Pul( f' 3X) = ao + a(x — Xo) + A2(xX — X0)(x — x1) 
eet An(X — Xo)(X = X1) +++ (% — Xn-1); (2.111) 


which is called Newton’s form. The constants involved can be determined, in 
principle, by the interpolation conditions 

fo = 4, 

fi = a + a1 (x1 — Xo), 

So = ao + a4 (X2 — X0) + d2(X2 — X0)(%2 — 1), 


and so on, which represent a triangular, nonsingular (why?) system of linear 
algebraic equations. This uniquely determines the constants; for example, 


ao = fo. 
fi - fo 
a, = rl 
X1| — Xo 


_ Jz = 4 = a1(%2 = Xo) 


az = 
(x2 — X0)(X2 — x1) 
and so on. Evidently, a, is a linear combination of fo, fi,..., fr, with coefficients 
that depend on Xo, x1,...,X,. We use the notation 
An = [X0,%1,..., Xn] f, n=0,1,2,..., (2.112) 


for this linear combination, and call the right-hand side the nth divided difference of 
f relative to the nodes xo, X1,...,Xn. Considered as a function of these n + 1 
variables, the divided difference is a symmetric function; that is, permuting the 
variables in any way does not affect the value of the function. This is a direct 
consequence of the fact that a, in (2.111) is the leading coefficient of p,(f;x): 
the interpolation polynomial p,(f; -) surely does not depend on the order in which 
we write down the interpolation conditions. 
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The name “divided difference” comes from the useful property 


[Mas Xone RES = Pa Minas AEA 
Xk — Xo 


(2.113) 


[X0,X1,X2,..., xk] f = 


expressing the kth divided difference as a difference of (k — 1)st divided differences, 
divided by a difference of the x;. Since we have symmetry, the order in which the 
variables are written down is immaterial; what is important is that the two divided 
differences (of the same order k — 1) in the numerator have k — 1 of the x; in 
common. The “extra” one in the first term, and the “extra” one in the second, are 
precisely the x; that appear in the denominator, in the same order. 

To prove (2.113), let 


(%) = pei F5 X15 X05.00458kF x) 
and 
S(X) = pri (fs Xo, X1,.--, Xk-15 X). 
Then 
X= XE 
PET a. Ri ME) S A) [r(x) — s(x)]. (2.114) 
Xk — X09 


Indeed, the polynomial on the right-hand side has clearly degree < k and takes on 
the correct value fj at x;,i =0,1,...,k. For example, ifi 4 Oandi #k, 


Xj — Xk Xj — Xk 
r(xi) + Ir(xi) — s(x] = fi + [fi- fil = fi. 
Xk — Xo Xk — X0 
and similarly fori = 0 and fori = k. By uniqueness of the interpolation 


polynomial, this implies (2.114). Now equating the leading coefficients on both 
sides of (2.114) immediately gives (2.113). 
Equation (2.113) can be used to generate the table of divided differences: 


x f —_ 
xo fo 
x fi [xo xf 


x2 fo [x1,x2]f  [x0, x1, x2] f 


x3 fs [X2,x3]f  [x1,X2,x3]f  [X0, x1, X2, x3] f 


(2.115) 


The divided differences are here arranged in such a manner that their computation 
proceeds according to one single rule: each entry is the difference of the entry 
immediately to the left and the one above it, divided by the difference of the x-value 
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horizontally to the left and the one opposite the f-value found by going diagonally 
up. Each entry, therefore, is calculated from its two neighbors immediately to the 
left, which is expressed by the computing stencil in (2.115). 

The divided differences do, aj,...,d, (cf.(2.112)) that occur in Newton’s 
formula (2.111) are precisely the first n + 1 diagonal entries in the table of 
divided differences. Their computation requires n(n + 1) additions and in(n + 1) 
divisions, essentially the same effort that was required in computing the auxiliary 
quantities AY in the barycentric formula (cf. Ex. 61). Adding another data point 
(X%n+1, fnr+1) requires the generation of the next line of divided differences. The 
last entry of this line is a,,41, and we can update p,(f; x) by adding to it the term 
An+1(X — Xo)(X — X1) +++ (X — Xp) to get Pn+i (cf. (2.110)). 


Example. 
x f 
0 3 
1 4 (4-3)/(1-0)= 1 
2 7 (7-4/Q-1I)=3 (@G-1/2-0)=1 


4 19 (19-7)/(4-2)=6 (6-3)(4-1)=1 C-1/(4-0)=0 
The cubic interpolation polynomial is 
p3( fix) =3+1-(x—-0)41-(x —0)(x — 1) +0: (x — 0)(x — 1)(x — 2) 
=34+x4+x(x-1) =34+ 2’, 


which indeed is the function tabulated. Note that the leading coefficient of p3(/; - ) 
is zero, which is why the last divided difference turned out to be 0. 

Newton’s formula also yields a new representation for the error term in Lagrange 
interpolation. Let ¢ temporarily denote an arbitrary “node” not equal to any of the 
X09, X1,...,Xn. Then we have, 


Pn+i(f; Xo, X1,---5 Xn, t3 X) 
n 
= Pal fix) + [x0, x1... Xn t]f - | [@ - 2). 
i=0 


Now put x = ¢; since the polynomial on the left-hand side interpolates to f at f, 
we get 


F(t) = Pal fit) + (xo. x1... tf -T]@ 25). 
i=0 


Writing again x for ¢ (which was arbitrary, after all), we find 


FS (x) — pr( fix) = [xo x1,-...¥n. x) f -[ [(e —2). (2.116) 
i=0 
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This is the new formula for the interpolation error. Note that it involves no derivative 
of f, only function values. The trouble is, that f(x) is one of them. Indeed, (2.116) 
is basically a tautology since, when everything is written out explicitly, the formula 
evaporates to 0 = 0, which is correct, but not overly exciting. 

In spite of this seeming emptiness of (2.116), we can draw from it an interesting 
and very useful conclusion. (For another application, see Chap. 3, Ex. 2.) Indeed, 
compare it with the earlier formula (2.60); one obtains 


FONE) 
(n+ 1)! 
where Xo, X1,...,Xy, X are arbitrary distinct points in [a,b] and f € C"*t![a, dD]. 
Moreover, &(x) is strictly between the smallest and largest of these points (cf. the 

proof of (2.60)). We can now write x = x,+1, and then replace n + 1 by 7 to get 


[X0,X1,---, Xn Xf _ 


’ 


1 
[x0,X1,...,Xn]f = at @): (2.117) 


Thus, for any n + 1 distinct points in [a,b] and any f € C" [a,b], the divided 
difference of f of order n is the nth scaled derivative of f at some (unknown) 
intermediate point. If we now let all x;, i > 1, tend to xo, then &, being trapped 
between them, must also tend to x9, and, since f “) is continuous at xo, we obtain 


1 
[xo,X0,-.-,X0]f = St Gur (2.118) 
we n! 


n+l times 


This suggests that the nth divided difference at n+1 “confluent” (i.e., identical) 
points be defined to be the nth derivative at this point divided by n!. This allows 
us, in the next section, to solve the Hermite interpolation problem. 


2.2.7 Hermite!” Interpolation 


The general Hermite interpolation problem consists of the following: given K + 1 
distinct points xo, x1,...,xx in [a,b] and corresponding integers m, > 1, and 
given a function f ¢€ C™~'{a,b], with M = max Mx, find a polynomial p of 


lowest degree such that, fork = 0,1,..., K, 


p(x.) = f, w=0,1,...,2%-1, (2.119) 


where i” = f)(x;) is the pth derivative of f at xx. 


!2Charles Hermite (1822-1901) was a leading French mathematician. An Academician in Paris, 
known for his extensive work in number theory, algebra, and analysis, he is famous for his 
proof in 1873 of the transcendental nature of the number e. He was also a mentor of the Dutch 
mathematician Stieltjes. 
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The problem can be thought of as a limiting case of Lagrange interpolation if 
we consider x; to be a point of multiplicity mx, that is, obtained by a confluence 
of m, distinct points into a single point x,. We can imagine setting up the table of 
divided differences, and Newton’s interpolation formula, just before the confluence 
takes place, and then simply “go to the limit.” To do this in practice requires that 
each point x; be entered exactly m, times in the first column of the table of divided 
differences. The formula (2.118) then allows us to initialize the divided differences 
for these points. For example, if m, = 4, then 


x Ff 
Xe fk 
xe te Je (2.120) 


xe fe Sl 4 
xe otk Kk Sh GK" 


Doing this initialization for each k, we are then ready to complete the table of 
divided differences in the usual way. (There will be no zero divisors; they have been 
taken care of during the initialization.) We obtain a table with mp +m, +---+mx 
entries in the first column, and hence an interpolation polynomial of degree < n = 
mo +m, +-+::+mx — 1, which, as in the Lagrange case, is unique. The m + 1 
diagonal entries in the table give us the coefficients in Newton’s formula, as before, 
except that in the product terms of the formula, some of the factors are repeated. 
Also the error term of interpolation remains in force, with the repetition of factors 
properly accounted for. 

We illustrate the procedure with two simple examples. 


1. Find p € P3 such that 


Mm" 


P(%0) = fo, p'(%o) = fo. Po) = fos Pp” (0) = fy” 
Here K = 0, mo = 4, that is, we have a single quadruple point. The table of 


divided differences is precisely the one in (2.120) (with k = 0); hence Newton’s 
formula becomes 


1 1 
P(x) = fot (x— x0) fo + ha — x0)’ fo’ + g@ — x0) fo”, 


which is nothing but the Taylor polynomial of degree 3. Thus Taylor’s polyno- 
mial is a special case of a Hermite interpolation polynomial. The error term of 
interpolation, furthermore, gives us 


f(x) — p(x) = att — xo)t f(), & between xo and x, 


which is Lagrange’s form of the remainder term in Taylor’s formula. 
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i, f fy 


Xo x) xy 


Fig. 2.8 A Hermite interpolation problem 


2. Find p € P3 such that 


p@o)= fo. pea) = Ar P’'Gi) = A pGa) = fh, 


where Xo < X1 < Xo (cf. Fig. 2.8). 
The table of divided differences now has the form: 


x f 
xo fo 
x fi [xo mf 
x ft fi [xo, x1, X11 f 
x2 fo [mx f [x1 x1, x2) f [X0, 41, %1, x2] f 
If we denote the diagonal entries, as before, by do, a1, a2, a3, Newton’s 
formula takes the form 
P(x) = ao + a(x — Xo) + a2(x — X0)(x — x1) + a3(x — X0)(X — x1), 


and the error formula becomes 


bias (3) 


f(%) = PO) = (= x0) — 11)°(@ — 2) 


, X90 <E < Xp. 


For equally spaced points, say, x9 = x;—hA, x2 = x; +h, we have, if x = x;+th, 
-1<t<1, 


|(x — x0) (% — x1)? — x2) = |? — 10? - 4] < ght, 
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and so 


1 (4) 
If — Pleo < {Ht Me 


ht 
= (4) 
Spe = alto. 


with the oo-norm referring to the interval [x, x2]. 


2.2.8 Inverse Interpolation 


An interesting application of interpolation — and, in particular, of Newton’s formula 
—is to the solution of a nonlinear equation, 


f(x) =0. (2.121) 


Here f is a given (nonlinear) function, and we are interested in a root a of the 
equation for which we already have two approximations, 


Xo Ya, XY. 
We assume further that near the root a, the function f is monotone, so that 
y = f(x) has aninverse x = f~!(y). 


Denote, for short, 
(vy) = f 10). 
Since a = g(0), our problem is to evaluate g(0). From our two approximations, we 


can compute yp = f(xo) and y; = f(x1), giving xo = g(yo), X1 = g(1). Hence, 
we can start a table of divided differences for the inverse function g: 


Yo Xo 
vy X1 [yo, vil 


Wanting to compute g(0), we can get a first improved approximation by linear 
interpolation, 


x2 = xX0 + (0— yo)l¥0, ile = x0 — volvo, yilg- 


Now evaluating yo = f(x2), we get x2 = g(y2). Hence, the table of divided 
differences can be updated and becomes 
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yo Xo 
vy M1 [yo il 
y2 x2 [n, yale [vo 1, vale 


This allows us to use quadratic interpolation to get, again with Newton’s formula, 
x3 = x2 + (0— yo)(0— yi) Ly, v1, Yale = x2 + yoyil¥o, M1. Yale 


and then 
y3 = f(x3), and x3 = g(y3). 


Since yo, y; are small, the product yoy; is even smaller, making the correction term 
added to the linear interpolant x2 quite small. If necessary, we can continue updating 
the difference table, 

y §& 

Yo Xo 

yi x [yo Nils 

yo x2 [yi yale [vo 1. yelg 

ys x3 [y2,y3lg [i y2,y3l8 [vo y1. 2, yale 


and computing 
x4 = X3— yoy Y2[¥o. V1, ¥2,Y3lg, ys = f (Xa), x4 = B(y4), 


giving us another data point to generate the next row of divided differences, and so 
on. In general, the process will converge rapidly: x, — a as k — ov. The precise 
analysis of convergence, however, is not simple because of the complicated structure 
of the successive derivatives of the inverse function g = f—!. 


2.3 Approximation and Interpolation by Spline Functions 


Our concern in Sect.2.1.1 was with approximation of functions by a single 
polynomial over a finite interval [a,b]. When more accuracy was wanted, we 
simply increased the degree of the polynomial, and under suitable assumptions the 
approximation indeed can be made as accurate as one wishes by choosing the degree 
of the approximating polynomial sufficiently large. 
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However, there are other ways to control accuracy. One is to impose a subdivision 
A upon the interval [a,b], 


A: @=X, <x. <X3 <0 < Xp <x, =), (2.122) 


and use low-degree polynomials on each subinterval [x;,x;+:] @ = 1,2,..., 
n—1) to approximate the given function. The rationale behind this is the recognition 
that on a sufficiently small interval, functions can be approximated arbitrarily 
well by polynomials of low degree, even degree 1, or zero, for that matter. Thus, 
measuring the “fineness” of the subdivision A by 


|A| = max Axi AX; = Xi41 — Xi, (2.123) 


1<i<n— 


we try to control (increase) the accuracy by varying (decreasing) |A|, keeping the 
degrees of the polynomial pieces uniformly low. 

To discuss these approximation processes, we make use of the class of functions 
(cf. Example 2 at the beginning of Chap. 2) 


Si,(A) = fs: s €C*a,b], Pn, i=1,2,....n-1, (2.124) 


5\ aca = 


where m > 0,k > O are given nonnegative integers. We refer to Sk (A) as the spline 
functions of degree m and smoothness class k relative to the subdivision A. (If the 
subdivision is understood from the context, we omit A in the notation on the left- 
hand side of (2.124).) The point in the continuity assumption of (2.124), of course, 
is that the kth derivative of s is to be continuous everywhere on [a,b], in particular, 
also at the subdivision points x; (@ = 2,...,n — 1) of A. One extreme case is 
k = m, in which case s € S)" necessarily consists of just one single polynomial of 
degree m on the whole interval [a,b]; that is, S” = P,,, (see Ex. 68). Since we want 
to get away from P,,, we assume k < m. The other extreme is the case where no 
continuity at all (at the subdivision points x; ) is required; we then put k = —1. Thus 
S;,!(A) is the class of piecewise polynomials of degree < m, where the polynomial 
pieces can be completely disjoint (see Fig. 2.9). 

We begin with the simplest case — piecewise linear approximation — that is, the 


case m = | (hence k = 0). 


2.3.1 Interpolation by Piecewise Linear Functions 


The problem here is to find an s € S?(A) such that, for a given function f defined 
on [a,b], we have 


s(x;) = fi where fi = f(x), i =1,2,..., n. (2.125) 
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Fig. 2.9 A function s € S7! 


if 
$$ fh ~: 
a =x) x2 x3 %4 xn. 1 a 


Fig. 2.10 Piecewise linear interpolation. 


We conveniently let the interpolation nodes coincide with the points x; of the 
subdivision A in (2.122). This simplifies matters, but is not necessary (cf. Ex. 75). 
The solution then indeed is trivial; see Fig. 2.10. If we denote the (obviously unique) 
interpolant by s(-) = si(/; -), then the formula of linear interpolation gives 


S(f3x) = fp +(x —x) fy. x41] f for x; < x < x41, 7 = 1,2,...,n—1. 
(2.126) 


A bit more interesting is the analysis of the error. This, too, however, is quite 
straightforward, once we note that s;(f;-) on [x;,x;+1] is simply the linear 
interpolant to f. Thus, from the theory of (linear) interpolation, 


F(X) — S13 x) = & — xi) — 4) [Xi Xi41, x] f for x € PG, x41]; 


hence, if f € C?[a, b], 


(Ax;)? 
== max [f" |, ee al. 
8 [xi 7-41] 


lfix)=—si(f3*)| Ss 
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It then follows immediately that 


IC -—9CF3 Jlloo = ZIAPIS lloo, (2.127) 


where the maximum norms are those on [a,b]; that is, ||gllo = max |g|. This 
[a,b] 


shows that the error indeed can be made arbitrarily small, uniformly on [a,b], by 
taking |A| sufficiently small. Making |A| smaller, of course, increases the number 
of polynomial pieces, and with it, the volume of data. 

It is easy to show (see Ex. 80(b)) that 


distoo(fS9) < f(-)—s1F: loo < 2 disteo(f,S9), (2.128) 


where, for any set of functions S, 
distoo(f,S) := inf || f — slloo. 


In other words, the piecewise linear interpolant s\(f;-) is a nearly optimal 
approximation, its error differing from the error of the best approximant to f from 
S! by at most a factor of 2. 


2.3.2 A Basis for S}(A) 


What is the dimension of the space S?(A)? In other words, how many degrees of 
freedom do we have? If, for the moment, we ignore the continuity requirement (i.e., 
if we look at Si (A)), then each linear piece has two degrees of freedom, and there 
are n — | pieces; so dim S;!(A) = 2n — 2. Each continuity requirement imposes 
one equation, and hence reduces the degree of freedom by 1. Since continuity must 
be enforced only at the interior subdivision points x;,i = 2,...,n —1, we find that 
dim S?(A) = 2n—2—(n—2) =n. So we expect that a basis of S?(A) must consist 
of exactly basis functions. 

We now define 7 such functions. For notational convenience, we let x) = x; and 


Xn+1 = Xn; then, fori? = 1,2,...,n, we define 
X — Xi-1 . 
——__ if H-15 x < Xj, 
Xi — Xi-1 
: — J Mi4+17-X~L 
B(x)= 5) SE it x <x < x44, (2.129) 
Xi+1 — Xi 


0 otherwise. 
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aBXQrXy *9 *3 x4 


Fig. 2.11 The functions B; 


Note that the first equation, when i = 1, and the second, wheni = n, are to be 
ignored, since x in both cases is restricted to a single point and the ratio in question 
has the meaningless form 0/0. (It is the other ratio that provides the necessary 
information in these cases.) The functions B; may be referred to as “hat functions” 
(Chinese hats), but note that the first and last hat is cut in half. The functions B; are 
depicted in Fig. 2.11. We expect these functions to form a basis of S}?(A). To prove 
this, we must show: 


(a) the functions {B;}/_, are linearly independent and 
(b) they span the space S?(A). 


Both these properties follow from the basic fact that 


1 if i=j, 
Bi(x;) = bij = (2.130) 


0 if i #j, 


which one easily reads from Fig. 2.11. To show (a), assume there is a linear 
combination of the B; that vanishes identically on [a,b], 


s(x) =) Bi(x), s(x) =0 on [a,d]. (2.131) 


i=1 


Putting x = x; in (2.131) and using (2.130) then gives c; = 0. Since this holds 
for each 7 = 1,2,...,n, we see that only the trivial linear combination (with all 
ci; = 0) can vanish identically. To prove (b), let s € S?(A) be given arbitrarily. We 
must show that s can be represented as a linear combination of the B;. We claim 
that, indeed, 

so) =>" se Be). (2.132) 


i=1 
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This is so, because the function on the right-hand side has the same values as s at 
each x;, and therefore, being in S?( A), must coincide with s. 

Equation (2.132), which holds for every s € S°(A), may be thought of as the 
analogue of the Lagrange interpolation formula for polynomials. The role of the 
elementary Lagrange polynomials £; is now played by the B;. 


2.3.3 Least Squares Approximation 


As an application of the basis {B;}, we consider the problem of least squares 
approximation on [a,b] by functions in S?(A). The discrete Ly approximation 
problem with data given at the points x; ( = 1,2,...,), of course, has the trivial 
solution s;(f; -), which drives the error to zero at each data point. We therefore 
consider only the continuous problem: given f € C[a,b], find §;(f;-) € S{(A) 
such that 


b b 
/ [f(x) — §\(f 3 x)Pdx <| [f(x) — s(x)Pdx forall se S?(A). (2.133) 
Writing 
si) = So GBR), (2.134) 
i=1 


we know from the general theory of Sect. 2.1 that the coefficients ¢; must satisfy the 
normal equations 


| 


b b 
/ BoBC Cj -| B(x) f(x)dx, i =1,2,...,n. (2.135) 


a a 


b 
Now the fact that B; is nonzero only on (x;—1,x;+1) implies that B;(x) 


-B,(x)dx = 0 if |i — j| > 1; that is, the system (2.135) is tridiagonal. An easy 
computation (cf. Ex. 77) indeed yields 


1 1 1 
eo Gj iF 3x 1+ Ax)e + Ax C41 = b;, i= Ly DQicccisy fs (2.136) 


where b; = figs (x) f(x)dx = yal B;(x) f(x)dx. Note, by our convention, that 
Axo = 0 and Ax, = 0, so that (2.136) is in fact a tridiagonal system for the 
unknowns ¢), C2,..., Cy. Its matrix is given by 
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SAX 2 Ax 0 
2 AX (Ax + Ax) LAX? 


1 

g Axe 
ll 
gAXn-1 


1 1 
0 gAXn-1 zAXn-1 


As it must be, by the general theory of Sect.2.1, the matrix is symmetric 
and positive definite, but it is also diagonally dominant, each diagonal element 
exceeding by a factor of 2 the sum of the (positive) off-diagonal elements in the 
same row. The system (2.136) can therefore be solved easily, rapidly, and accurately 
by the Gauss elimination procedure, and there is no need for pivoting. 

Like the interpolant s(/; - ), the least squares approximant 5, (f; -), too, can be 
shown to be nearly optimal, in that 


distoo(f,S?) < ||f(-)— 51: -)lloo < 4 distoo(f, 89). (2.137) 


The spread is now by a factor of 4, rather than 2, as in (2.128). 


2.3.4 Interpolation by Cubic Splines 


The most widely used splines are cubic splines, in particular, cubic spline inter- 
polants. We first discuss the interpolation problem for splines s € S}(A). Continuity 
of the first derivative of any cubic spline interpolant s3(f; -) can be enforced by 


prescribing the values of the first derivative at each point x;,i = 1,2,...,n. Thus 
let 1, M2,...,M, be arbitrary given numbers, and denote 

53(f; * \ Nise ae nal = pj(x), i= 1,2,...,n—-1. (2.138) 
Then, we enforce s4(f;x;) = mj,i = 1,2,...,n, by selecting each piece p; of 


53(f; -) to be the (unique) solution of a Hermite interpolation problem, namely, 


Pilxi) = fi, pii+i) = fit, 
bH1,9,.35,n— 1. (2.139) 


Pi(xi) =mj;, pi (x41) = mis, 


We solve (2.139) by Newton’s interpolation formula. The required divided differ- 
ences are: 
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Xj fi 


xi Off Mm; 


[xi Xi4a] f — mi 


mai fier Pe Half 


AX; 
mista —[xi. Xin) f mins + mi — 2[X7, x141] f 
Xi+1 Si 1 Mi+1 Ae (Axi)? 
I i 


and the interpolation polynomial (in Newton’s form) is 
y Pe tial on; 
AX; 


migi +m, — 27, x41 f 
(Ax)? 


Di(x) = fi + (x — xj); + (x — x} 


+ (x — x)? (x — x141) 


Alternatively, in Taylor’s form, we can write 


Pi(X) = cin + G11 — Xi) + cia(x — x1)° + cia(x — xi)’, 


Xj 2X S X41, (2.140) 


where, by noting that x — xj4; = x — x; — Axj;, 


Cio = fi 
Ci = Mj, 
Se oh, 
: Kee . 
mz, +m, — 2[x;, x41] f 
3= 2.141 
Ci3 TESS ( ) 


Thus to compute s3(f;x) for any given x € [a,b] that is not an interpolation 
node, one first locates the interval [x;,x;+1] containing x and then computes the 
corresponding piece (2.138) by (2.140) and (2.141). 

We now discuss some possible choices of the parameters m1, ™2,...,1Mn. 


(a) Piecewise cubic Hermite interpolation. Here one selects m; = f’(x;), assum- 
ing that these derivative values are known. This gives rise to a strictly local 
scheme, in that each piece p; can be determined independently from the others. 
Furthermore, the error of interpolation is easily estimated, since from the theory 
of interpolation, 


F(x) — pi(x) = (x — xy — x41)? [i Xi, X41, M414) f, Xj XS X41; 
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(b 


wm 


hence, if f € C4[a, b], 


Loo \f| 
f(x) — pi(x)| < | SAX: max <A SE a7; 
2 bixigi] = 4! 
There follows: 
IFC) -—53(F3 Jlloo < lA IS leo: (2.142) 


In the case of equally spaced points x;, one has |A| = (6 — a)/(n — 1) and, 
therefore, 


IFC-)-—53(F3 Ilo = OM“) as n > ov. (2.143) 


This is quite satisfactory, but note that the derivative of { must be known at 
each point x;, and the interpolant is only in C'{a, 5]. 

As to the derivative values, one could approximate them by the derivatives 
of po( fs Xi-1, Xi, Xj41;X) at x = x;, which requires only function values of 
jf, except at the endpoints, where again the derivatives of f are involved, the 
points a = x9 = x; andb = x, = X,+1 being double points (cf. Ex. 78). It can 
be shown that this degrades the accuracy to O(|A]?). 

Cubic spline interpolation. Here we require s3(f;-) € S3(A), that is, 
continuity of the second derivative. In terms of the pieces (2.138) of s3(/; -), 
this means that 


Po) =F) Gis T=2 35,8 —1, (2.144) 
and translates into a condition for the Taylor coefficients in (2.140), namely, 
2 Cai 9 F 6 Ci-1,3° AXxj-| =2 Ci2, i= 2; 3; i n—. 


Plugging in the explicit values (2.141) for these coefficients, we arrive at the 
linear system 


(Ax; )m; ‘a 2(Ax; i Ax; )m; + (Axj-1)mj41 = bj, i= 223; .. nl, 


(2.145) 

where 
bi = 3{(Axi) [x11 xi] f + (Axi) Pi, x41) F}- (2.146) 
These are n — 2 linear equations in the n unknowns m1, ™2,...,M™y. Once m, 


and m, have been chosen in some way, the system again becomes tridiagonal 
in the remaining unknowns and hence is readily solved by Gauss elimination. 
Here are some possible choices of m,; and my. 


110 


(b.1) 


(b.2) 


(b.3) 


(b.4) 


2 Approximation and Interpolation 
Complete splines: m, = f'(a), my, = f'(b). It is known that for this spline 


IFC) -sOFS Dlloo S erlAL TIF lloos 7 = 0, 1,2, 3, if f € C*la, d], 


(2.147) 
where co = ar C= a o2= :, and c3 is a constant depending on the mesh 
ratio — LAI -. Rather remarkably, the bound for r = 0 is only five times larger 

min; Ax. 


than the bound (2.142) for the piecewise cubic Hermite interpolant, which 
requires derivative values of f at all interpolation nodes x;, not just at the 
endpoints a and b. 
Matching of the second derivatives at the endpoints: s{(f;a) = f"(a), 
s{(f;b) = f(b). Each of these conditions gives rise to an additional 
equation, namely, 


1 
2m, +m = 3[X1, x2) f — xf @Ax, 
1 
Mn—1 +2my = 3[Xn—-1, Xn] f + sf (b)AXn—-1 (2.148) 


One conveniently adjoins the first equation to the top of the system (2.145), 
and the second to the bottom, thereby preserving the tridiagonal structure of 
the system. 

Natural cubic spline: s"(f;a) = s"(f;b) = 0. This again produces two 
additional equations, which can be obtained from (2.148) by putting there 
ft" (a) = f"(b) = 0. They are adjoined to the system (2.145) as described in 
(b.2). The nice thing about this spline is that it requires only function values 
of f —no derivatives! — but the price one pays is a degradation of the accuracy 
to O(|A|?) near the endpoints (unless indeed f”(a) = f(b) = 0). 
“Not-a-knot spline” (C. de Boor): here we require p\(x) = p2(x) and 
Pn—2(x) = Pn—1(%); that is, the first two pieces of the spline should be the 
same polynomial, and similarly for the last two pieces. In effect, this means 
that the first interior knot x2, and the last one x,—;, both are inactive (hence 
the name). This again gives rise to two supplementary equations expressing 


continuity of s4"(f;x) at x = x2 and x = x,-1 (cf. Ex.79). 


2.3.5 Minimality Properties of Cubic Spline Interpolants 


The complete and natural splines defined in (b.1) and (b.3) of the preceding 
section have interesting optimality properties. To formulate them, it is convenient 
to consider not only the subdivision A in (2.122), but also the subdivision 


Ns @=X9 =X, < XQ <3 <0 < Xp <p = M1 =), (2.149) 


in which the endpoints are double knots. This means that whenever we interpolate 
on A’, we interpolate not only to function values at all interior points but also to the 
function as well as first derivative values at the endpoints. 
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The first of the two theorems relates to the complete cubic spline interpolant, 
Sena fF ; -). 


Theorem 2.3.1. For any function g € C?[a,b] that interpolates f on A’, there 
holds 


b b 
/ [ge (x)Pdx > i [97 (fsx) Px, (2.150) 
with equality if and only if g(-) = Scomi(f3 +). 


Note that Scompi(f; -) in Theorem 2.3.1 also interpolates f on A’, and among all 
such interpolants its second derivative has the smallest Ly norm. 


Proof of Theorem 2.3.1. We write (for short) 5.4m: = 5. The theorem follows, once 
we have shown that 


b b b 
/ [e" (x) Pax = i le” (x) — s"(a)PPdx + / [s”(x)Pedx. (2.151) 


Indeed, this immediately implies (2.150), and equality in (2.150) holds if and only if 
g” (x)—s"(x) = 0, which, integrating twice from a to x and using the interpolation 
properties of s and g at x = a gives g(x) = s(x). 

To complete the proof, note that (2.151) is equivalent to 


b 
/ s"(x)[g” (x) — 8" (x)]dx = 0. (2.152) 


Integrating by parts, we get 
b 
[ "cote" -s" ax 
‘ b 
= s"(x)[g'(x) — s'(x)]|, -{ 5" (x)[g' (x) — s"(x)]dx 


b 
= -| s”"(x)[g’ (x) — s'(x)]dx, (2.153) 
since s’(b) = g’(b) = f’(b), and similarly at x = a. But s’” is piecewise constant, so 


b 
/ 5" (x)Le!(x) — 8! )]dx 


n—1 


Syaka 
= Ds" +9 fe) -s'@lax 


v=1 


n—-1 


= S09" + Olgrr41) — sev41) — (ge) — (%))] = 0, 


v=1 
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since both s and g interpolate to f on A. This proves (2.152) and hence the 
theorem. Oo 

For interpolation on A, the distinction of being optimal goes to the natural cubic 
spline interpolant s,,.(; -). This is the content of the second theorem. 


Theorem 2.3.2. For any function g € C?[a, b] that interpolates f on A (not A’), 
there holds 


b b 
i [2" (x)/'dx = [s” (f:x)Pdx, (2.154) 


with equality if and only if g(-) = Sna(f 3 +). 


The proof of Theorem 2.3.2 is virtually the same as that of Theorem 2.3.1, since 
(2.153) holds again, this time because s”(b) = s"(a) = 0. Oo 


Putting g(-) = Scom(f; -) in Theorem 2.3.2 immediately gives 


b b 
/ sf: ede = / ish) Pda (2.155) 


Therefore, in a sense, the natural cubic spline is the “smoothest” interpolant. 

The property expressed in Theorem 2.3.2 is the origin of the name “spline.” 
A spline is a flexible strip of wood used in drawing curves. If its shape is given by 
the equation y = g(x),a < x <b, and if the spline is constrained to pass through 
the points (x;, g;), then it assumes a form that minimizes the bending energy 


” _[g"(x)Pdx 
a (+ ([s(x)P)P 


over all functions g similarly constrained. For slowly varying g (||g’llo. < 1), this 
is nearly the same as the minimum property of Theorem 2.3.2. 


2.4 Notes to Chapter 2 


There are many excellent texts on the general problem of best approximation as 
exemplified by (2.1). One that emphasizes uniform approximation by polynomials 
is Feinerman and Newman [1974]; apart from the basic theory of best polynomial 
approximation, it also contains no fewer than four proofs of the fundamental 
theorem of Weierstrass. For approximation in the Loo and L; norm, which is 
related to linear programming, a number of constructive methods, notably the 
Remez algorithms and exchange algorithms, are known, both for polynomial and 
rational approximation. Early, but still very readable, expositions are given in 
Cheney [1998] and Rivlin [1981], and more recent accounts in Watson [1980] 
and Powell [1981]. Nearly-best polynomial and rational approximations are widely 
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used in computer routines for special functions; for a survey of work in this 
area, up to about 1975, see Gautschi [1975a], and for subsequent work, van der 
Laan and Temme [1984] and Németh [1992]. Much relevant material is also 
contained in the books by Luke [1975] and [1977]. The numerical approximation 
and software for special functions is the subject of Gil et al. [2007]; exhaustive 
documentation can also be found in Lozier and Olver [1994]. A package for some 
of the more esoteric functions is described in MacLeod [1996]. For an extensive 
(and mathematically demanding) treatment of rational approximation, the reader is 
referred to Petrushev and Popov [1987], and for L; approximation, to Pinkus [1989]. 
Methods of nonlinear approximation, including approximation by exponential sums, 
are studied in Braess [1986]. Other basic texts on approximation and interpolation 
are Natanson [1964, 1965, 1965] and Davis [1975] from the 1960s, and the more 
recent books by DeVore and Lorentz [1993] and its sequel, Lorentz et al. [1996]. A 
large variety of problems of interpolation and approximation by rational functions 
(including polynomials) in the complex plane is studied in Walsh [1969]. An 
example of a linear space ® containing a denumerable set of nonrational basis 
functions are the sinc functions — scaled translates of nt They are of importance 
in the Shannon sampling and interpolation theory (see, e.g., Zayed [1993]) and are 
also useful for approximation on infinite or semi-infinite domains in the complex 
plane; see Stenger [1993], [2000] and Kowalski et al. [1995] for an extensive 
discussion of this. A reader interested in issues of current interest related to 
multivariate approximation can get a good start by consulting Cheney [1986]. 

Rich and valuable sources on polynomials and their numerous properties 
of interest in applied analysis are Milovanovié et al. [1994] and Borwein and 
Erdélyi [1995]. Spline functions — in name and as a basic tool of approximation — 
were introduced in 1946 by Schoenberg [1946]; also see Schoenberg [1973]. They 
have generated enormous interest, owing both to their interesting mathematical 
theory and practical usefulness. There are now many texts available, treating 
splines from various points of view. A selected list is Ahlberg et al. [1967], 
Niirnberger [1989], and Schumaker [2007] for the basic theory, de Boor [2001] and 
Spath [1995] for more practical aspects including algorithms, Atteia [1992] for an 
abstract treatment based on Hilbert kernels, Bartels et al. [1987] and Dierckx [1993] 
for applications to computer graphics and geometric modeling, and Chui [1988], de 
Boor et al. [1993], and Bojanov et al. [1993] for multivariate splines. The standard 
text on trigonometric series still is Zygmund [2002] . 


Section 2.1. Historically, the least squares principle evolved in the context of 
discrete linear approximation. The principle was first enunciated by Legendre in 
1805 in a treatise on celestial mechanics (Legendre [1805]), although Gauss used it 
earlier in 1794, but published the method only in 1809 (in a paper also on celestial 
mechanics). For Gauss’s subsequent treatises, published in 1821-1826, see the 
English translation in Gauss [1995]. The statistical justification of least squares as 
a minimum variance (unbiased) estimator is due to Gauss. If one were to disregard 
probabilistic arguments, then, as Gauss already remarked (Goldstine [1977, p.212]), 
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one could try to minimize the sum of any even (positive) power of the errors, and 
even let this power go to infinity, in which case one would minimize the maximum 
error. But by these principles “ ... we should be led into the most complicated 
calculations.” Interestingly, Laplace at about the same time also proposed discrete 
L, approximation (under the side condition that all errors add up to zero). A reader 
interested in the history of least squares may wish to consult the article by 
Sheynin [1993]. 

The choice of weights w; in the discrete Ly norm || - ||2,, can be motivated on 
statistical grounds if one assumes that the errors in the data f(x; ) are uncorrelated 
and have zero mean and variances 07; an appropriate choice then is w; = 0; °. 

The discrete problem of minimizing || f — ¢|l2,, over functions g in ® as given 
by (2.2) can be rephrased in terms of an overdetermined system of linear equations, 
Pe = f, where P = [;(x;)] is a rectangular matrix of size N x n, and 
JS = [f(;)] the data vector of dimension VN. If r = f — Pc, r = [rj] denotes the 
residual vector, one tries to find the coefficient vector c € R” such that ye Wi ea iS 
as small as possible. There is a vast literature dealing with overdetermined systems 
involving more general (full or sparse) matrices and their solution by the method 
of least squares. A large arsenal of modern techniques of matrix computation 
can be brought to bear on this problem; see, for example, Bjérck [1996] for an 
extensive discussion. In the special case considered here, the method of (discrete) 
orthogonal polynomials, however, is more efficient. It has its origin in the work of 
Chebyshev [1859]; a more contemporary exposition, including computational and 
Statistical issues, is given in Forsythe [1957]. 

There are interesting variations on the theme of polynomial least squares 
approximation. One is to minimize || f — pl|l2a, among all polynomials in P,, 
subject to interpolatory constraints at m + | given points, where m < n. It turns 
out that this can be reduced to an unconstrained least squares problem, but for 
a different measure dA and a different function f; cf.Gautschi [1996, Sect. 2.1]. 
Something similar is true for approximation by rational functions with a prescribed 
denominator polynomial. A more substantial variation consists in wanting to ap- 
proximate simultaneously a function f and its first s derivatives. In the most general 
setting, this would require the minimization of [, )~>_-oLf'?() — p ()?dac(t) 
among all polynomials p € P,,, where dA, are given (continuous or discrete) 
positive measures. The problem can be solved, as in Sect. 2.1.2, by orthogonal 
polynomials, but they are now orthogonal with respect to the inner product 
(uv) H, = o=o Seg UO (Kv (t)dA, (t) — a so-called Sobolev inner product. This 
gives rise to Sobolev orthogonal polynomials; see Gautschi [2004, Sect. 1.7] for 
some history on this problem and relevant literature. 


Section 2.1.2. The alternative form (2.25) of computing the coefficients ¢; was 
suggested in the 1972 edition of Conte and de Boor [1980] and is further discussed 
by Shampine [1975]. The Gram—Schmidt procedure described at the end of this 
section is now called the classical Gram—Schmidt procedure. There are other, 
modified, versions of Gram—Schmidt that are computationally more effective; see, 
for example, Bjérck [1996, pp. 61 ff]. 
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Section 2.1.4. The standard text on Fourier series, as already mentioned, is 
Zygmund [2002], and on orthogonal polynomials, Szegé [1975]. Not only is it true 
that orthogonal polynomials satisfy a three-term recurrence relation (2.38), but the 
converse is also true: any system {2} of monic polynomials satisfying (2.38) for all 
k > 0, with real coefficients a, and 6, > 0, is necessarily orthogonal with respect 
to some (in general unknown) positive measure. This is known as Favard’s Theorem 
(cf., e.g., Natanson [1965], Vol. 2, Chap. 8, Sect. 6]). The computation of orthogonal 
polynomials, when the recursion coefficients are not known explicitly, is not an easy 
task; a number of methods are surveyed in Gautschi [1996]; see also Gautschi [2004, 
Chap. 2]. Orthogonal systems in L2(R) that have become prominent in recent 
years are wavelets, which are functions of the form Wjx(t) = 2//y(2/t — k), 
Jk = 0,+1,+2,... , with w a “mother wavelet” — square integrable on 
R and (usually) satisfying i w(t)dt = 0. Among the growing textbook and 
monograph literature on this subject, we mention Chui [1992], Daubechies [1992], 
Walter [1994], Wickerhauser [1994], Hernandez and Weiss [1996], Resnikoff and 
Wells [1998], Burrus et al. [1998], and Novikov et al. [2010]. 


Section 2.2. Although interpolation by polynomials and spline functions is most 
common, it is sometimes appropriate to use other systems of approximants for in- 
terpolation, for example, trigonometric polynomials or rational functions. Trigono- 
metric interpolation at equally spaced points is closely related to discrete Fourier 
analysis and hence accessible to the Fast Fourier Transform (FFT). For this, and 
also for rational interpolation algorithms, see, for example, Stoer and Bulirsch 
(2002, Sects. 2.1.1 and 2.2]. For the fast Fourier transform and some of its important 
applications, see Henrici [1979a] and Van Loan [1992]. 

Besides Lagrange and Hermite interpolation, other types of interpolation 
processes have been studied in the literature. Among these are Fejér—Hermite 
interpolation, where one interpolates to given function values and requires 
the derivative to vanish at these points, and Birkhoff (also called lacunary) 
interpolation, which is similar to Hermite interpolation, but derivatives of only 
preselected orders are being interpolated. Remarkably, Fejér—Hermite interpolation 
at the Chebyshev points (defined in Sect.2.2.4) converges for every continuous 
function f € C[-1,1]. The convergence theory of Lagrange and Fejér—Hermite 
interpolation is the subject of a monograph by Szabados and Vértesi [1990]. 
The most comprehensive work on Birkhoff interpolation is the book by G.G. 
Lorentz et al. [1983]. A more recent monograph by R. A. Lorentz [1992] deals with 
multivariate Birkhoff interpolation. 


Section 2.2.1. The growth of the Lebesgue constants A, is at least O(logn) as 
n — oo; specifically, A, > 2logn + c for any triangular array of interpolation 
nodes (cf. Sect. 2.1.4), where the constant c can be expressed in terms of Euler’s 
constant y (cf. Chap.1,MA 4) by c = 2 (log2+y) = 0.9625228...; see 
Rivlin [1990,Theorem 1.2]. The Chebyshev points achieve the optimal order 
O(logn); for them, A, < 2logn + 1 (Rivlin [1990, Theorem 1.2]). Equally 
spaced nodes, on the other hand, lead to exponential growth of the Lebesgue 
constants inasmuch as A, ~ 2”*!/(enlogn) for n —> oo; see Trefethen and 
Weideman [1991] for some history on this result and Brutman [1997a] for a recent 
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survey on Lebesgue constants. The very last statement of Sect. 2.1.2 is the content 
of Faber’s Theorem (see, e.g., Natanson [1965, Vol. 3, Chap. 2, Theorem 2]), which 
says that, no matter how one chooses the triangular array of nodes (2.64) in 
[a, b], there is always a continuous function f € Cla, b] for which the Lagrange 
interpolation process does not converge uniformly to f. Indeed, there is an f € 
Ca, b] for which Lagrange interpolation diverges almost everywhere in [a, b]; see 
Erdos and Vértesi [1980]. Compare this with Fejér—Hermite interpolation. 


Section 2.2.3. A more complete discussion of how the convergence domain of 
Lagrange interpolation in the complex plane depends on the limit distribution of 
the interpolation nodes can be found in Krylov [1962, Chap. 12, Sect. 2]. 

Runge’s example is further elucidated in Epperson [1987]. For an analysis 
of Bernstein’s example, we refer to Natanson [1965, Vol. 3, Chap. 2, Sect. 2]. The 
same divergence phenomenon, incidentally, is exhibited also for a large class of 
nonequally spaced nodes; see Brutman and Passow [1995]. The proof of Example 5 
follows Fejér [1918]. 


Section 2.2.4. The Chebyshev polynomial arguably is one of the most interesting 
polynomials from the point of view not only of approximation theory, but also of 
algebra and number theory. In Rivlin’s words, it “ ... is like a fine jewel that 
reveals different characteristics under illumination from various positions.” In his 
text, Rivlin [1990] gives ample testimony in support of this view. Another text, 
unfortunately available only in Russian (or Polish), is Paszkowski [1983], which 
has an exhaustive account of analytic properties of Chebyshev polynomials as well 
as numerical applications. 

The convergence result stated in (2.97) follows from (2.59) and the logarithmic 
growth of A,,, since €,(f) logn — 0 for f € C![-1, 1] by Jackson’s theorems (cf. 
Cheney [1998, p. 147]). A more rigorous estimate for the error in (2.102) is €,(f) < 
Iltu—S lloo < (4 + 4 logn) &,(f) (Rivlin [1990, Theorem 3.3]), where the infinity 
norm refers to the interval [—1, 1] and €,(/) is the best uniform approximation of 
f on [-1, 1] by polynomials of degree n. 


Section 2.2.5. A precursor of the algorithm (2.108) expressing Ae in the form of 
a sum rather than a product, and thus susceptible to serious cancellation errors, was 
proposed in Werner [1984]. The more stable algorithm given in the text is due to 
Berrut and Trefethen [2004]. Barycentric formulae have been developed also for 
trigonometric interpolation (see Henrici [1979b] for uniform, and Salzer [1949] and 
Berrut [1984] for nonuniform distributions of the nodes), and for cardinal (sinc-) 
interpolation (Berrut [1989]); for the latter, see also Gautschi [2001] and Chap. 1, 
MA 10. 


Section 2.2.7. There are explicit formulae, analogous to Lagrange’s formula, 
for Hermite interpolation in the most general case; see, for example, Stoer and 
Bulirsch [2002, Sect. 2.1.5]. For the important special case my = 2, see also 
Chap. 3, Ex. 34(a). 


Section 2.2.8. To estimate the error of inverse interpolation, using an appropriate 
version of (2.60), one needs the derivatives of the inverse function f—!. A general 
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expression for the nth derivative of f—! in terms of the first n derivatives of f is 
derived in Ostrowski [1973, Appendix C]. 


Section 2.3. The definition of the class of spline functions Sk (A) can be refined to 
Si, (A), where k* = [ky,k3,...,ky—,] is a vector with integer components k; > —1 
specifying the degree of smoothness at the interior knots x;; that is, sY)(x; +0) — 
s(x; —0) = 0 for j = 0,1,...,k;. Then S& (A) as defined in (2.124) becomes 
Si (A) with k = [k,k,...,k]. 


Section 2.3.1. As simple as the procedure of piecewise linear interpolation may 
seem, it can be applied to advantage in numerical Fourier analysis, for example. In 
trying to compute the (complex) Fourier coefficients c,(f) = st ad f(xje™* dx 
of a 27-periodic function /, one often approximates them by the “discrete Fourier 
transform” ¢,(f) = = aia f (xx)e"**, where x, = k om This can be computed 
efficiently (for large N’) by the Fast Fourier Transform. Note, however, that ¢,(/) 
is periodic in n with period N, whereas the true Fourier coefficients c,(f) tend 
to zero as n — oo. To remove this deficiency, one can approximate f by some 
(simple) function g and thereby approximate c,(f) by c,(g). Then c,(g) will 
indeed tend to zero as n — oo. The simplest choice for ¢ is precisely the piecewise 
linear interpolant g = s(/; -) (relative to the uniform partition of [0,27] into N 
subintervals). One then finds, rather remarkably (see Chap. 3, Ex. 14), that c,(@) is 
a multiple of the discrete Fourier transform, namely, cn (f) = ti€n(f), where t), = 


; 2 
(224) ; this still allows the application of the FFT but corrects the behavior 


of ¢,(f) at infinity. The same modification of the discrete Fourier transform by an 
“attenuation factor” t,, occurs for many other approximation processes f ~ g; see 
Gautschi [1971/1972] for a general theory (and history) of attenuation factors. 

The near optimality of the piecewise linear interpolant s;(f; -), as expressed by 
the inequalities in (2.128), is noted by de Boor [2001, p. 31]. 


Section 2.3.2. The basis (2.129) for S?(A) is a special case of a B-spline basis 
that can be defined for any space of spline functions S‘, (A) previously introduced 
(cf. de Boor [2001, Theorem IX(44)]. The B-splines are formed by means of divided 
differences of order m + 1 applied to the truncated power (t — x)" (considered as 
a function of ¢). Like the basis in (2.129), each basis function of a B-spline basis is 
supported on at most m + 1 consecutive intervals of A and is positive on the interior 


of the support. 


Section 2.3.3. A proof of the near optimality of the piecewise linear least squares 
approximant $|(f;-), as expressed by the inequalities (2.137), can be found in 
de Boor [2001, p. 32]. For smoothing and least squares approximation procedures 
involving cubic splines, see, for example, de Boor [2001, Chap. XIV]. 


Section 2.3.4. (a) For the remark in the last paragraph of (a), see de Boor [2001, 
Chap. 4, Problem 3]. 

(b.1) The error bounds in (2.147), which for r = 0 andr = | are asymptotically 
sharp, are due to Hall and Meyer [1976]. 

(b.2) The cubic spline interpolant matching second derivatives at the endpoints 


satisfies the same error bounds as in (2.147) for r = 0, 1, 2, with constants cp = ae 
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3 


Cc) = jg andc, = 3. 


z; see Kershaw [1971, Theorem 2]. The same is shown also for 
periodic spline interpolants s, satisfying s“) (a) = s“(b) forr = 0, 1, 2. 

(b.3) Even though the natural spline interpolant, in general, converges only with 
order |A|? (e.g., for uniform partitions A), it has been shown by Atkinson [1968] 
that the order of convergence is | A|* on any compact interval contained in the open 
interval (a,b), and by Kershaw [1971] even on intervals extending (in a sense 
made precise) to [a,b] as |A| — 0. On such intervals, in fact, the natural spline 
interpolant s provides approximations to any f € C*[a,b] with errors satisfying 
IF —s]leo < 8¢,K|A|*", where K = 24+ 3|| foo and co = §,¢1 = 3; 
and cp = 1. 

(b.4) The error of the “not-a-knot” spline interpolant is of the same order as 
the error of the complete spline; it follows from Beatson [1986, (2.49)] that for 
functions f € C*[a,b], one has || f — slo < crlAl* "|| fll, r = 0, 
1, 2 (at least when n > 6), where c, are constants independent of f and A. The 
same bounds are valid for other schemes that depend only on function values, 
for example, the scheme with m, equal to the first (or second) derivative of 
D3(f 3X1, X2,X3,X4; +) at x = a, and similarly for m,. The first of these schemes 
(using first-order derivatives of p3) is in fact the one recommended by Beatson and 
Chacko [1989, 1992] for general-purpose interpolation. Numerical experiments in 
Beatson and Chacko [1989] suggest values of approximately | for the constants c, 
in the preceding error estimates. In Beatson and Chacko [1992] further comparisons 
are made among many other cubic spline interpolation schemes. 


Section 2.3.5. The minimum norm property of natural splines (Theorem 2.3.1) 
and its proof based on the identity (2.151), called “the first integral relation” in 
Ahlberg et al. [1967], is due to Holladay [1957], who derived it in the context of 
numerical quadrature. “Much of the present-day theory of splines began with this 
theorem” (Ahlberg et al. [1967, p. 3]). An elegant alternative proof of (2.152), and 
hence of the theorem, can be based (cf.de Boor [2001, pp. 64—66]) on the Peano 
representation (see Chap. 3, Sect. 3.2.6) of the second divided difference of g — s, 
that is, [x;-1. x7. Xi41(g—5) = fo K(t)(g"(t) — 8” (t))d¢, by noting that the Peano 
kernel K, up to a constant, is the B-spline B; defined in (2.129). Since the left-hand 
side is zero by the interpolation properties of g and s, it follows from the preceding 
equation that g” — s” is orthogonal to the span of the B;, hence to s”, which lies in 
this span. 


Exercises and Machine Assignments to Chapter 2 
Exercises 
1. Suppose you want to approximate the function 
-l if —l<t<0O, 


f(t) = 0 if t=0, 
1 if O<t<l 
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by a constant function g(x) = c: 


(a) on [-1,1] in the continuous L; norm, 


(b) on {t;,f,...,ty}in the discrete L; norm, 

(c) on [—1,1] in the continuous Lz norm, 

(d) on {t),f%,...,ty}in the discrete L> norm, 

(e) on [-1,1] in the co-norm, 

(f) on {t),f2,...,¢y} in the discrete co-norm. 

The weighting in all norms is uniform (i.e., w(t) = 1,w; = 1) andg, = 
-1+ “> ,i = 1,2,...,.N. Determine the best constant c (or constants c, if 


there is nonuniqueness) and the minimum error. 
2. Consider the data 


f(t) =1, i=1,2,...,.N—-1; flty)=y > 1. 


(a) Determine the discrete Lo. approximant to f by means of a constant c 
(polynomial of degree zero). 

(b) Do the same for discrete (equally weighted) least square approximation. 

(c) Compare and discuss the results, especially as N — oo. 


3. Let xo, X1,...,X,» be pairwise distinct points in [a,b], -coo < a < b < oo, and 
f € C'[a, b]. Show that, given any ¢ > 0, there exists a polynomial p such that 
lf — Plloo < © and, at the same time, p(x;) = f(x;),i = 0,1,...,n. Here 
IlUloo = MaXg<x<p |Uu(X)|. {Hint: write p = pr(f;-) + Ong, where pn(f; -) 
is the interpolation polynomial of degree n (cf. Sect. 2.2.1,(2.51)), @n(x) = 
TTjeo(« — xi). q € P, and apply Weierstrass’s approximation theorem.} 

4. Consider the function f(t) = t* on 0 < t < 1, where a > 0. Suppose we 
want to approximate f best in the L, norm by a constant c, 0 < c < 1, that is, 
minimize the L, error 


1 1/p 
E,(c) = ||t* —ellp = (/ \t@ - <\Pat) 
0 


as a function of c. Find the optimal c = c, for p = oo, p = 2, and p = 1, 
and determine E,(c,) for each of these p-values. 

5. Taylor expansion yields the simple approximation e* ~ 1+ x,0< x <1. 
Suppose you want to improve this by seeking an approximation of the form 
e* x 1+cx,0 <x < 1, for some suitable c. 


(a) How must c be chosen if the approximation is to be optimal in the 
(continuous, equally weighted) least squares sense? 
Sketch the error curves e; (x) :=e* —(1+x) and e2(x) :=e*—(1+cx) with 
c as obtained in (a) and determine maxg<;<) |e1(x)| and maxo<y<1 |e2(x)]. 
(c) Solve the analogous problem with three instead of two terms in the 
modified Taylor expansion: e* + 1 + cx + cx’, and provide error curves 
I 2 


for ey(x) =e* —l-x—- 5x? and e2(x) = e* — 1 —c,x — cox". 


(b 


we 
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6. 


10. 
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Prove Schwarz’s inequality 
(ue, v)| < [lal - [lvl 


for the inner product (2.10). {Hint: use the nonnegativity of ||u + tv||?,  € R.} 


. Discuss uniqueness and nonuniqueness of the least squares approximant to a 


function f in the case of a discrete set T = {t),f)} (i.e., N = 2) and ®, = 
P,—1 (polynomials of degree < n — 1). In case of nonuniqueness, determine all 
solutions. 


. Determine the least squares approximation 


a c2 
th= ——,, O0<t<l, 
eO= TW tage CSS 
to the exponential function f(t) = e', assuming dA(t) = dr on [0,1]. 
Determine the condition number condgoA = ||A|lo0|| Aloo of the coefficient 


matrix A of the normal equations. Calculate the error f(t)— g(t) att = 0,t = 
1/2, and t = 1. {Point of information: the integral lie t™e*'dt = En, (x) 
is known as the “mth exponential integral”; cf. Abramowitz and Stegun [1964, 
(5.1.4)] or Olver et al. [2010, (8.19.3)].} 


. Approximate the circular quarter arc y given by the equation y(t) = V1 —??, 


0 <t < 1 (see figure) by a straight line @ in the least squares sense, using either 
the weight function w(t) = (1 — 1?)7/?,0 < t < l,orw(t) = 1,0<t <1. 
Where does £ intersect the coordinate axes in these two cases? 


{Points of information: i cos” dé = &, te cos* 640 = 2} 


(a) Let the class ®, of approximating functions have the following properties. 
Each g € ®, is defined on an interval [a, b] symmetric with respect to the 
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origin (i.e., a = —b), and g(t) € ®, implies g(—t) € ®,. Let dA(t) = 
w(t)dt, with w(t) an even function on [a, D] (i.e., @(—t) = w(t)). Show: if 
f is an even function on [a, b], then so is its least squares approximant, @,, 
on [a, b] from ®,,. 
(l=¢ a 02rs 1, 
(l+t if -1<1<0. 

Determine its least squares approximation on [—1,1] by a polynomial of 
degree <2. (Use dA(t) = dt.) Simplify your calculation by using part (a). 
Determine where the error vanishes. 


(b) Consider the “hat function” f(t) = 


. Suppose you want to approximate the step function 


fO= {1 a Os¢ <1, 


(0 ift>1 
on the positive line R+ by a linear combination of exponentials 7;(t) = 
ey = 12s, n, in the (continuous, equally weighted) least squares 
sense 


(a) Derive the normal equations. How is the matrix related to the Hilbert 
matrix? 

(b) Use Matlab to solve the normal equations form = 1,2,...,8. Print n, the 
Euclidean condition number of the matrix (supplied by the Matlab function 
cond.m), along with the solution. Plot the approximations vs. the exact 
function for 1 <n < 4. 


. Let xj (t) = (t—a;)7!, j = 1,2,...,, where a; are distinct real numbers 


with |a;| > 1, 7 = 1,2,...,n. FordA(t) = dt on—-1 <¢ < landdd(t) = 0, 
t ¢ [-1, 1], determine the matrix of the normal equations for the least squares 
problem /,(f — y)*dA(t) = min, gp = D"_, cj 7;. Can the sytem {7;}"_), 
n > 1, be an orthogonal system for suitable choices of the constants a;? 
Explain. 


. Given an integer n > 1, consider the subdivision A,, of the interval [0, 1] into 


n equal subintervals of length 1/n. Let 7; (t), 7 = 0,1,...,7, be the function 
having the value | at t = j/n, decreasing on either side linearly to zero at the 
neighboring subdivision points (if any), and being zero elsewhere. 


(a) Draw a picture of these functions. Describe in words the meaning of a linear 
combination 2(t) = 07-9 ¢j 7; (t). 

(b) Determine m;(k/n) for j,k =0,1,...,” 

(c) Show that the system {z; (t)}; =o is linearly independent on the interval 
O<t< i os it also linearly independent on the set of subdivision points 0, 
1. 2, ..., >, 1 of A,? Explain. 

(d) Compute the matrix of the normal equations for {z;}, assuming dA(t) = dt 
on [0,1]. That is, compute the (n + 1) x (7 + 1) matrix A = [aj;;], where 


aij = i Tj (t)x; (t)dt. 
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. Even though the function f(t) = In(1/t) becomes infinite as tf — 0, it can 


be approximated on [0,1] arbitrarily well by polynomials of sufficiently high 
degree in the (continuous, equally weighted) least squares sense. Show this by 
proving 


€n2 = min _ = ; 
o= min | f — pl = —— 
{Hint: use the following known facts about the “shifted” Legendre polynomial 
mj (t) of degree j (orthogonal on [0,1] with respect to the weight function w = 
1 and normalized to satisfy 2; (1) = 1): 


[ ‘ 1 1 if j =O, 
nnd ==. j 20: fm mafnar=y oy 
6 2j+1 0 ch if 7S 0. 


The first relation is well known from the theory of orthogonal polynomials (see, 
e.g., Sect. 1.5.1, p. 27 of Gautschi [2004]); the second is due to Blue [1979].} 


. Let dA be a continuous (positive) measure on [a,b] andn > | a given integer. 


Assume f continuous on [a,b] and not a polynomial of degree < n — 1. Let 
Pn—1 € Py—1 be the least squares approximant to f on [a, b] from polynomials 
of degree < n — 1: 


b b 
/ [pn—i(t) — f)Paa(t) < / [p(t) — f(PaA(), all p € Py. 


Prove: the error e,, (tf) = Py—1(t) — f(t) changes sign at least n times in [a, b]. 
{Hint: assume the contrary and develop a contradiction.} 


. Let f bea given function on [0,1] satisfying f(0) = 0, f(1) = 1. 


(a) Reduce the problem of approximating f on [0,1] in the (continuous, 
equally weighted) least squares sense by a quadratic polynomial p satis- 
fying p(0) = 0, p(1) = 1 to an unconstrained least squares problem (for 
a different function). 

(b) Apply the result of (a) to f(t) = t”, r > 2. Plot the approximation against 
the exact function for r = 3. 


. Suppose you want to approximate f(t) on [a,b] by a function of the form 


r(t) = z(t)/q(t) in the least squares sense with weight function w, where 
x € P, and q is a given function (e.g., a polynomial) such that g(t) > 0 on 
[a, b]. Formulate this problem as an ordinary polynomial least squares problem 
for an appropriate new function f and new weight function W. 


. The Bernstein polynomials of degree n are defined by 


B® = (F)Ha-o7, 7=C, Lia.ah, 
J 


and are usually employed on the interval 0 < ¢ < 1. 
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(a) Show that Bj (0) = 1, andfor 7 = 1,2,...,n 


d’ d/ 
— Brit =0, r=0,1,...,.j-1; — Br 0. 
ar BIO, =O 9 irk gz BOl_ 4 


(b 


wm 


What are the analogous properties at t = 1, and how are they most easily 
derived? 

Prepare a plot of the fourth-degree polynomials Bi (t), j = 0,224, 
0<t<l. 

Use (a) to show that the system { B7 (1) }/;_o 1s linearly independent on [0,1] 
and spans the space P,,. 

(e) Show that DL i=0 Bit) = 1. {Hint: use the binomial theorem.} 


(c 


wm 


(d 


wm 


19. Prove that, if {7; p24 is linearly dependent on the support of dA, then the 
matrix A = [a;;], where aj; = (77, 7j)aa = te mj (t); (t)dA(t), is singular. 

20. Given the recursion relation m,4,(t) = (t — ag) (t) — Beme-i(t), k = 
0,1,2,..., for the (monic) orthogonal polynomials {7 (-;dA)}, and defining 
Bo = fp dA(t), show that || ||? = BoB1--- Be, k = 0,1,2,.... 

21. (a) Derive the three-term recurrence relation 


V Bev itesi(t) = (t — ax) tet) — VBeite-1, k =0,1,2,..., 
ft1(t)=0, ito = 1/V'Bo 


for the orthonormal polynomials 7; = 2x /||7x||,k =0,1,2,.... 
(b) Use the result of (a) to derive the Christoffel—Darboux formula 


V Ba+i sia a 0 = mn (X) i414 (t) ‘ 


Yo te) = 


k=0 


22. (a) Let m,(-) = m,(-;dA) be the (monic) orthogonal polynomial of degree n 
relative to the positive measure dA on R. Show: 


[Haas [ roa, all pe Py, 
R R 


where P,, is the class of monic polynomials of degree n. Discuss the case 
of equality. {Hint: represent p in terms of 7;(-;dA), 7 =0,1,...,n.} 

If dA(t) = dAw(f) is a discrete measure with exactly N support points 
tiebo,...,ty,andm;(t) = 2;(-;dAy), 7 =0,1,...,N—1, are the corre- 
sponding (monic) orthogonal polynomials, let zy (t) = (t-—a@ny_—1)aNn_-1(t) 
— By—\7Nn-2(t), with wy—1, By—, defined as in Sect. 2.1.4(2). Show that 
my(tj) =Ofor j =1,2,...,N. 


(b 


wm 
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23. 


24. 


2): 
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Let {z; ¥i=0 be a system of orthogonal polynomials, not necessarily monic, 
relative to the (positive) measure dA. For some aj;;, define 


pit) = yaa); LS Dens agitt’y 


J=0 


(a) Derive conditions on the matrix A = [a;;] which ensure that the system 
{Pi t7=o is also a system of orthogonal polynomials. 

(b) Assuming all 2; monic and {p;}/_, an orthogonal system, show that each 
pi is monic if and only if A = J is the identity matrix. 

(c) Prove the same as in (b), with “monic” replaced by “orthonormal” through- 
out. 


Let (u,v) = yoy wxu(ty) v(t.) be a discrete inner product on the interval [— 
1,1] with -1 < t) < fh < +--+ < ty < 1, and let ax, Bx be the recursion 
coefficients for the (monic) orthogonal polynomials {7; (¢) nie associated with 


(u, Vv): 


Tke+i(t) = (t — ox) aK (t) — Beme-i(t), 
k =0,1,2,...,N —2, 
mis) aaa, 


Let x = bay + app map the interval [-1,1] to [a, b], and the points 
th € [-1,1] to x, € [a,b]. Define (u,v)* = hy wxu(xx)v(x,), and let 
{x (x)}=} be the (monic) orthogonal polynomials associated with (u, v)*. 
Express the recursion coefficients a, 6; for the {x} in terms of those for 
{1}. (Hint: first show that 2 (x) = (4 eg (x — ath))} 

Let 


Me+i(t) = (t — a) aK (t) — Beae-i(t), 
(x) k =0,1,2,...,n—1, 
mo(t)=1, m-\(t) =0 


and consider 


Prlt) = Yo ejmj(t). 


J=0 


Show that p, can be computed by the following algorithm (Clenshaw’s 
algorithm): 
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26. 


27. 


28. 


29. 


30. 


Un = Cn, Unt+i = 0, 


Uz = (t — or ugg — Beriue+e + Ck, 
(xx) 
k=n-—1,n—-2,...,0, 


Pn = Uo. 


{Hint: write («) in matrix form in terms of the vector x’ = [700, 1,+--5 Tn] 
and a unit triangular matrix. Do likewise for («*).} 

Show that the elementary Lagrange interpolation polynomials €; (x) are invari- 
ant with respect to any linear transformation of the independent variable. 

Use Matlab to prepare plots of the Lebesgue function for interpolation, 
An(x), -1 < x < 1, forn = 5,10, 20, with the interpolation nodes x; being 
given by 


(a) x, =-1+4, §=0,1,2,...,n; 


(b) x; = cos pit Tn, i=0,1,2,...,n. 


Compute 1,,(x) on a grid obtained by dividing each interval [x;-1,x;],i = 
1,2,...,, into 20 equal subintervals. Plot log,) An(x) in case (a), and A,(x) 
in case (b). Comment on the results. 

Let w(x) = [pao (x — &) and denote by x, the location of the extremum of 
@, on [0,1], that is, the unique x in [0,1], where ow! (x) = 0. 


(a) Prove or disprove that x, - 0 asn — oo. 
(b) Investigate the monotonicity of x, as n increases. 


Consider equidistant sampling points x, = k (k = 0,1,...,”) and @,(x) = 
pa0 (X —k),O<x <n. 


(a) Show that w,(x) = (—1)"*!o,(n — x). What kind of symmetry does this 
imply? 

(b) Show that | @,(x)| < | @n(x + 1)| for nonintegral x > (n — 1)/2. 

(c) Show that the relative maxima of | w,(x)| increase monotonically (from the 
center of [0,7] outward). 


Let 
Ane = S WGG)| 
i=0 
be the Lebesgue function for polynomial interpolation at the distinct points 
x; € [a,b], i = 0,1,...,n, and Ay = |lAn|loo = maxgex<p |An(x)| the 


Lebesgue constant. Let p,(f; -) be the polynomial of degree < n interpolating 
f at the nodes x;. Show that in the inequality 


lPuCf; Ilo s AnllF lloo: f € C{a, b], 
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32. 


33. 


34. 


35. 


36. 
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equality can be attained for some f = g € Cla,b]. {Hint: let ||An|loo = 

An(Xoo); take gy € C[a, b] piecewise linear and such that g(x;) = sgn £; (Xoo), 

i=0,1,...,n.} 

(a) Let xo, X1,...,X, be m + 1 distinct points in [a,b] and fj = f(x;),i = 
0,1,...,m, for some function f. Let f7* = fj + ¢, where |e;| < e. Use 
the Lagrange interpolation formula to show that | py (f*; x) — pPu(f;x)| < 
EAn(x), a < x <b, where 1,,(x) is the Lebesgue function (cf. Ex. 30). 

(b) Show: A,(x;) = 1 for 7 =0,1,...,7. 

(c) For quadratic interpolation at three equally spaced points, show that 
A2(x) < 1.25 for any x between the three points. 

(d) Obtain A(x) for x9 = 0, x; = 1, x2 = p, where p > 1, and determine 
max}<x<p, A2(x). How fast does this maximum grow with p? {Hint: to 
simplify the algebra, note from (b) that A2(x) on 1 < x < p must be 
of the form A2(x) = 1 + c(x — 1)(p — x) for some constant c.} 

In a table of the Bessel function Jo(x) = +f cos(x sin@)d@, where x is 

incremented in steps of size h, how small must / be chosen if the table is to 

be “linearly interpolable” with error less that 10~° in absolute value? {Point of 

information: ie sin’ 6d0 = 7} 

Suppose you have a table of the logarithm function Inx for positive integer 

values of x, and you compute In 11.1 by quadratic interpolation at x» = 10, 

Xx; = 11, x») = 12. Estimate the relative error incurred. 

The “Airy function” y(x) = Ai(x) is a solution of the differential equation 

y” = xy satisfying appropriate initial conditions. It is known that Ai(x) on 

[0, 00) is monotonically decreasing to zero and Ai’ (x) monotonically increasing 

to zero. Suppose you have a table of Ai and Ai’ (with tabular step 2) and you 

want to interpolate 


(a) linearly between xo and x1, 
(b) quadratically between xo, x1, and x2, 


where Xo, X1 = Xo +h, X2 = Xo + 2h are (positive) tabular arguments. 
Determine close upper bounds for the respective errors in terms of quantities 
Ve = V(XK), Y, = Y' (Xe), k = 0, 1,2, contained in the table. 
The error in linear interpolation of f at xo, x; is known to be 


(x) — pil fs x) = (x — xo)(x — x 


>» AOS *% < X11, 


) f"(E(X)) 
: 3 


if f € C?[xo, x1]. Determine &(x) explicitly in the case f(x) = +, xo = 1, 

x, = 2, and find max)<y<2 &(x) and minj<;<2 & (x). : 

(a) Let p,(f;x) be the interpolation polynomial of degree < n interpolating 
F(x) = e* at the points x} = i/n, i = 0,1,2,...,n. Derive an upper 
bound for 

max le* — pa( F320) 


O<x< 
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37. 


38. 


39. 


40. 


41. 


42. 


and determine the smallest guaranteeing an error less than 10~° on 
[0, 1]. {Hint: first show that for any integer i with 0 < i < n one has 
maxo<x<i |(x — £)(x — %)| < f. 

(b) Solve the analogous problem for the nth-degree Taylor polynomial ¢, (x) = 


1 


l+x+ an free f x, and compare the result with the one in (a). 

Let x9 < x1 < xX. < +++ < x, and H = maxo<j<n-1(Xi+1 — x;). Defining 
n(x) = []j29(x—7), find an upper bound for ||@n|oo = MAXx)<x<x, | On(X)| 
in terms of H and n. {Hint: assume x; < x < x;41 forsome 0 < j <n and 
estimate (x — x;)(x — x;+41) and [| i+; (x — x;) separately.} 

Show that the power x” on the ideal —l < x < 1 can be uniformly 
approximated by a linear combination of powers 1, x,x?,...,x"~! with error 
< 2-@—)_ In this sense, the powers of x become “less and less linearly 
independent” on [—1,1] with growing exponent 7. 

Determine 


min max |dox" agin ae ss +a,|, n>, 
a<x<b 


where the minimum is taken over all real ao, @1,...,@y With ag 4 0. {Hint: use 
Theorem 2.2.1.} 

Leta > 1 and P4 = {p EP, : p(a) = 1}. Define p, € P4 by p,(x) = ee ; 
where T,, is the Chebyshev polynomial of degree n, and let || - ||oo denote the 
maximum norm on the interval [—1, 1]. Prove: 


I[Pulloo < Ilplloo forall p € Py. 


{Hint: imitate the proof of Theorem 2.2.1.} 
Let 


[o@) e! 
fo= f ee, Steet. 
5 


t—x 
and let p,—1(/; -) be the polynomial of degree < n—1 interpolating f at then 
Chebyshev points x, = cos( oan zw), v = 1,2,...,n. Derive an upper bound 
for max—j<y<1 | f(X) — Pn-1(f,x)|- 
Let f be a positive function defined on [a, b] and assume 


min |f(x)| = mo, max | f®(x)| = Mg, k =0,1,2,.... 
a<x<b a<x<b 


(a) Denote by pn-i(f; -) the polynomial of degree < n — 1 interpolating 
f at the n Chebyshev points (relative to the interval [a, b]). Estimate the 
maximum relative error 7, = mMaxg<x<p |(f(X) — Pn—-i(f; x))/f (x). 

(b) Apply the result of (a) to f(x) = Inx on J, = fe’ <x <e’t},r>1an 
integer. In particular, show that r, < a(r,n)c", where 0 < c < | anda is 
slowly varying. Exhibit c. 
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44, 


45. 


46. 
47. 


48. 


49. 


50. 
D1. 
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(c) (This relates to the function f(x) = Inx of part (b).) How does one 
compute f(x), x € Js, from f(x), x € I,? 


(a) For quadratic interpolation on equally spaced points xo, x1 = Xo + A, 
X2 = Xo + 2h, derive an upper bound for || f — p2o(f; -)\loo involving 
IP" Noo and h. (Here ales = MaXx9<x<x2 |u(x)|.) 

(b) Compare the bound obtained in (a) with the analogous one for interpolation 

at the three Chebyshev points on [xo, x2]. 

Suppose the function f(x) = In(2 + x), -1 < x < 1, is interpolated By a 

polynomial p, of degree < n at the Chebyshev points x, = cos (+ +3 5m), 

k = 0,1,...,n. Derive a bound for the maximum error || f — Pnlloo = 

max—j<yr<1 | f(x) — Pn(x)|- 

(b) Compare the result of (a) with bounds for || f — tn|loo, where t, (x) is the 
nth-degree Taylor polynomial of f and where either Lagrange’s form of 
the remainder is used or the full Taylor expansion of /. 

Consider f(t) = cos"! t, —1 < t < 1. Obtain the least squares approximation 

@n € P, of f relative to the weight function w(t) = (1 — t)7?; that is, find the 

solution y = @, of 


(a 


wa 


dt 
Koa 
Express @, in terms of Chebyshev polynomials z(t) = Tj (t). 
Compute 7,'(0), where T,, is the Chebyshev polynomial of degree n. 
Prove that the system of Chebyshev polynomials {7, : 0 < k < n} is 
orthogonal with respect to the discrete inner product (u,v) = }*"_, u(xy)v(xy), 
where x,, are the Chebyshev points x, = cos a tat: 
Let T;,(x) denote the Chebyshev polynomial of degree k. Clearly, 7, (Tin(x)) is 
a polynomial of degree n - m. Identify it. 
Let T,, denote the Chebyshev polynomial of degree n > 2. The equation 


minimize | [ i oor 


x= T(x) 


is an algebraic equation of degree n and hence has exactly n roots. Identify 
them. 

For any x with 0 < x < 1 show that T;,(2x — 1) = Toy (/X). 

Let f(x) be defined for all x € R and infinitely often differentiable on R. 
Assume further that 


| f™(x)| <1, all x €R, m= 1,2,3,.... 


Let h > 0 and p2,—1 be the polynomial of degree < 2n interpolating f at the 
2n points x = kh,k = +1,+2,...,-:n. For what values of h is it true that 


fim Pan—1(0) = F(0) ? 
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54. 


39; 


56. 


(Note that x = 0 is not an interpolation node.) Explain why the convergence 
theory discussed in Sect. 2.2.3 does not apply here. {Point of information: n! ~ 
V2mn(n/e)" asn —> oo (Stirling’s formula).} 

(a) Let x€ = cos($457),i = 0,1,...,, be Chebyshev points on [—1, 1]. 
Obtain the analogous Chebyshev points A on [a,b] (where a < b) and 
find an upper bound of []/_)(t —t£) fora <t <b. 

(b) Consider f(t) = Int on [a,b], 0 < a < b, and let p,(t) = athe. 


ee ie #2). Given a > 0, how large can b be chosen such that 


limp—+oo Pn(t) = f(t) for arbitrary nodes a € [a,b] and arbitrary t € 
[a,b]? 

(c) Repeat (b), but with 7 = t© (see (a)). 

Let P* be the set of all polynomials of degree < m that are nonnegative on the 


m 
real line, 


Ph ={p: p€Pm, p(x) = 0 forall x € R}. 


Consider the following interpolation problem: find p € P* such that p(x;) = 
fi, i =0,1,...,n, where fj > 0 and x; are distinct points on R. 


(a) Show that, ifm = 2n, the problem admits a solution for arbitrary f; > 0. 
(b) Prove: if a solution is to exist for arbitrary f; > 0, then, necessarily, 
m > 2n. {Hint: consider fo = 1, fi = fp =---= fr = 0.3 


Defining forward differences by Af(x) = f(x +h) — f(x), A? f(x) = 
A(Af (x)) = f(x + 2h) — 2 f(x +h) + f(x), and so on, show that 


AF f(x) = kth* [xo, x1,.... 2%) 


where x; = x+jh, 7 =0,1,2,.... Prove an analogous formula for backward 
differences. 

Let f(x) = x’. Compute the fifth divided difference [0,1,1,1,2,2] f of f. It is 
known that this divided difference is expressible in terms of the fifth derivative 
of f evaluated at some &, 0 < & < 2 (cf. (2.117)). Determine &. 

In this problem f(x) = e* throughout. 


(a) Prove: for any real number f, one has 


— 1)" 
[t,¢+1,...,t+n]f = Ge 2 e’. 
n!} 
{Hint: use induction on n.} 
(b) From (2.117) we know that 
(n) 
[0,1,...,n] f = f (5) 0<E<n. 


n! 


Use the result in (a) to determine &. Is € located to the left or to the right of 
the midpoint 1/2? 
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57. (Euler, 1734) Let x, = 10‘, k = 0,1,2,3,..., and f(x) = log) x 
(a) Show that 


(-1)""! 
10"@—1/2(10" a 1) 


[X0,X1,---, Xn] f = 


{Hint: prove more generally 


(-1)""! 
a 
10" tn@—)/2(1 0" = 1)’ _ 


[X-,Xp-41, Ss pMranlf — 


by induction on .} 
(b) Use Newton’s interpolation formula to determine p,(x) = pn(f; xo, 1, 
.,XnjX). Show that limy—>oo pn(x) exists for 1 < x < 10. Is the limit 
equal to log,, x? (Check, e.g., for x = 9.) 


58. Show that 


[X05 X15. 05 Mal f = Do, Xo; X19 2s Mul, 


Axo 


assuming f is differentiable at x). What about the partial derivative with respect 
to one of the other variables? 
59. (a) Forn + | distinct nodes x,, show that 


f%) 
[Xo X1,---, Xn] f = ——__——__.. 
> hemes _ Xp) 
(b) Show that 
[xXo.X1,---,Xn](fgj) = [Xo.¥1,---. Xj. Xj4i.--- Xa fh 
where gj (x) = x —x;. 
60. (Mikeladze, 1941) Assuming xo, x1,...,X, mutually distinct, show that 
Ee eee ONe,s es eee 2 
—$— | —— 
m times 
m times (m—1) times 


—$—— n = 7 ee 
[xo,---, Xo] f [Xo,.--.X0, Xv] f 
= tt y oe, 


{Hint: use induction on m.} 
61. Determine the number of additions and the number of multiplications/divisions 
required 


(a) to compute all divided differences for n + 1 data points, 
(b) to compute all auxiliary quantities a in (2.103), and 
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62. 


63. 


64. 


65. 


66. 


67. 


(c) to compute p,(f;-) (efficiently) from Newton’s formula (2.111), once 
the divided differences are available. Compare with the analogous count 
for the barycentric formula (2.105), assuming all auxiliary quantities 
available. Overall, which, if any, of the two formulae can be computed more 
economically? 


Consider the data f(0) = 5, f(1) = 3, f(3) = 5, f(4) = 12. 


(a) Obtain the appropriate interpolation polynomial p3(f;x) in Newton’s 
form. 

(b) The data suggest that f has a minimum between x = | and x = 3. Find 
an approximate value for the location xmin of the minimum. 


Let f(x) = (1+a)’, |a| < 1. Show that p,(f;0,1,...,7;x) is the truncation 
of the binomial series for f to n + 1 terms. {Hint: use Newton’s form of the 
interpolation polynomial.} 

Suppose f is a function on [0,3] for which one knows that 


fMO=1, f)=2, f\(Y=-1, FB) = f'B) =0. 


(a) Estimate (2), using Hermite interpolation. 

(b) Estimate the maximum possible error of the answer given in (a) if one 
knows, in addition, that f €¢ C°[0, 3] and | f(x)| < M on [0,3]. Express 
the answer in terms of M. 


(a) Use Hermite interpolation to find a polynomial of lowest degree satisfying 
p(-1) = p'(-1) = 0, pO) = 1, pQ) = p'() = ©. Simplify your 
expression for p as much as possible. 

Suppose the polynomial p of (a) is used to approximate the function 
f(x) = [cos(ax/2)}? on-1 < x <1. 


(b1) Express the error e(x) = f(x) — p(x) (for some fixed x in [—1, 1]) 
in terms of an appropriate derivative of f. 

(b2) Find an upper bound for |e(x)| (still for a fixed x € [—1, 1]). 

(b3) Estimate max—)<;<) |e(x)|. 


(b 


wm 


Consider the problem of finding a polynomial p € P,, such that 


PN Se. POS ESV Decent 


where x;,i = 1,2,...,n, are distinct nodes. (It is not excluded that x; = Xo.) 
This is neither a Lagrange nor a Hermite interpolation problem (why not?). 
Nevertheless, show that the problem has a unique solution and describe how it 
can be obtained. 

Let 


= 
II 
o 
=e 
oOo 
A 
IA 
— NI 
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71. 


72. 


73. 
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(a) Find the linear least squares approximant p, to f on [0,1], that is, the 
polynomial p; € P; for which 


1 
/ [pi(t) — f(t) dt = min. 


Use the normal equations with mo(t) = 1, m(t) =f. 

(b) Can you do better with continuous piecewise linear functions (relative to 
the partition [0,1] = [0, 5] U [5, 1]) ? Use the normal equations for the 
B-spline basis Bo, By, Bo (cf. Sect. 2.2.2 and Ex. 13). 


Show that S'(A) = Pn- 
Let A be the subdivision 


A = [0,1] U [1,2] U [2,3] 


of the interval [0,3]. Define the function s by 


2—x(3-—3x4+x°) if O<x<1, 
s(x) = 1 if ee So, 
7x°(3 — x) if 2<x <3. 
To which class Sk (A) does s belong? 
In 
p(x) if O<x <1, 
S(x) = 
(2—x)? if l<x<2 


determine p € P3 such that s(0) = 0 and s is a cubic spline in S3(A) on the 
subdivision A = [0, 1]U [1, 2] of the interval [0,2]. Do you get a natural spline? 


Let A: a = x1 < Xo < x3 <+++ < x, = b bea subdivision of [a,b] into 
n — | subintervals. What is the dimension of the space S‘ = {s € C*[a,b]: 
Slbxixigi] © Pai = We Dr sos ca n—1}? 

Given the subdivision A: a = x1 < x2 <--: < x, =) of [a,b], determine a 
basis of “hat functions” for the space S = {s € S?: s(a) = s(b) = 0}. 

Let A: a = xX} < X2 < x3 < +++ < Xy_1 < X, = D bea subdivision of 
[a, b] into n — 1 subintervals. Suppose we are given values f; = f(x;) of some 
function f(x) at the points x = x;,i = 1,2,..., n. In this problem s € Sj 


is a quadratic spline in C'[a, b] that interpolates f on A, that is, s(x;) = fi, 


(a) Explain why one expects an additional condition to be required in order to 
determine s uniquely. 

(b) Define m; = s'(x;),i = 1,2,...,n — 1. Determine p; := s [xi xigal> 
i=1,2,...,n—1,interms of fj, ff41, and m;. 
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(c) Suppose one takes m,; = f(a). (According to (a), this determines s 
uniquely.) Show how m2, m3,...,™,—1 can be computed. 


Let the subdivision A of [a, b] be given by 
At @=X, <X2 < x3 <0! < My] < xX, =D, n= 2, 


and let fi = f(x), i = 1,2,...,n, for some function f. Suppose you 
want to interpolate this data by a quintic spline s5(f; -) (a piecewise fifth- 
degree polynomial of smoothness class C*[a, b]). By counting the number of 
parameters at your disposal and the number of conditions imposed, state how 
many additional conditions (if any) you expect are needed to make s5(/; -) 
unique. 

Let 


A: A=X, < Xr <X3 <6! < Xp <M =D. 


Consider the following problem: given n — 1 numbers f, and n — 1 points &, 
with x, < & < x,4; (v = 1,2,...,n — 1), find a piecewise linear function 
s € S9(A) such that 


s(&) = fp v=1,2,...,n-1), s(x) = s(x). 


Representing s in terms of the basis By, Bo,...,B, of “hat functions,” 
determine the structure of the linear system of equations that you obtain for 
the coefficients c; in s(x) = S c; B; (x). Describe how you would solve 
the system. 

Let sj(x) = 1+ c(x + 1)?, -1 < x < 0, where c is a (real) parameter. 
Determine s7(x) on 0 < x < 1 so that 


w= 51 (x) - -l<x<0O, 
So(x) if O<x<l 

is a natural cubic spline on [—1, 1] with knots at —1, 0, 1. How must c be chosen 

if one wants s(1) = —1? 

Derive (2.136). 

Determine the quantities 7; in the variant of piecewise cubic Hermite interpo- 

lation mentioned at the end of Sect. 2.3.4(a). 

(a) Derive the two extra equations for m1, m2,...,m, that result from the 
“not-a-knot” condition (Sect.2.3.4, (b.4)) imposed on the cubic spline 
interpolant s € S3(A) (with A as in Ex. 73). 

(b) Adjoin the first of these equations to the top and the second to the bottom 
of the system of m — 2 equations derived in Sect.2.3.4(b). Then apply 
elementary row operations to produce a tridiagonal system. Display the 
new matrix elements in the first and last equations, simplified as much as 
possible. 
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(c) Is the tridiagonal system so obtained diagonally dominant? 

Let S9(A) be the class of continuous piecewise linear functions relative to the 
subdivision ad = x < x2 <-+-< x, = Db. Let ||glloo = maxgex<p |g(x)|, and 
denote by s;(g; -) the piecewise linear interpolant (from S?(A)) to g. 


(a) Show: ||s1(g; -)|loo < |lglloo for any g € C[a, 5]. 

(b) Show: || f—si(f; -)lloo < 2I|f —slleo for any s € S?, f € Cla, b]. {Hint: 
use additivity of s;(/; -) with respect to f.} 

(c) Interpret the result in (b) when s is the best uniform spline approximant 


to f. 


Consider the interval [a, b] = [—1, 1] and its subdivision A = [—1, 0] U [0, 1], 
and let f(x) =cos>x,-l<x <1. 


(a) Determine the natural cubic spline interpolant to f on A. 

(b) Illustrate Theorem 2.3.2 by taking in turn g(x) = po(f;-—1,0, 1; x) and 
g(x) = f(x). 

(c) Discuss analogously the complete cubic spline interpolant to f on A’ 
(cf. (2.149)) and the choices g(x) = p3(f;—1,0,1,1;x) and g(x) = 
f(x). 
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(a) A simple-minded approach to best uniform approximation of a function 
F(x) on [0,1] by a linear function ax + 5 is to first discretize the problem 
and then, for various (appropriate) trial values of a, solve the problem of 
(discrete) uniform approximation of f(x) — ax by a constant b (which 
admits an easy solution). Write a program to implement this idea. 

(b) Run your program for f(x) = e*, f(x) = 1/1 +x), f(x) = 
sin>x, f(x) = x* (a = 2,3,4,5). Print the respective optimal values of 
a and b and the associated minimum error. What do you find particularly 
interesting in the results (if anything)? 

(c) Give a heuristic explanation (and hence exact values) for the results, using 
the known fact that the error curve for the optimal linear approximation 
attains its maximum modulus at three consecutive points 0 < x9 < x1 < 
x2 < 1 with alternating signs (Principle of Alternation). 

(a) Determine the (n + 1) x (2 + 1) matrix A = [a,j], aj; = (B’, B"), of the 
normal equations relative to the Bernstein basis 


B"(t) = (")va —iy I, 7 =0,1,...,7, 
J 


and weight function w(t) = 1 on [0,1]. {Point of information: fo tk 
(_- t)dt =key/(kK+€4+ 1}. 
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(b) Use Matlab to solve the normal equations of (a) form = 5:5 : 25, when 
the function to be approximated is f(t) = 1. What should the exact answer 
be? For each n, print the infinity norm of the error vector and an estimate 
of the condition number of A. Comment on your results. 

3. Compute discrete least squares approximations to the function f(t) = sin (3) 
on 0 <t < 1 by polynomials of the form 


ort) =t+t(1—t)> ejti"!, n = 100)5, 


j=l 


using N abscissae t, = k/(N + 1), k = 1,2,...,N, and equal weights 1. 
Note that g, (0) = 0, g,(1) = 1 are the exact values of f att = Oandt = 1, 
respectively. {Hint: approximate f(t) — ¢ by a linear combination of z(t) = 
t/(1—t); 7 = 1,2,...,.} Write a Matlab program for solving the normal 
equations Ac = b, A = [(a;,2;)], b = [(ai, f — 1), e = [c;], that does 
the computation in both single and double precision. For eachn = 1,2,...,5 
output the following: 


e the condition number of the system (computed in double precision); 
¢ the maximum relative error in the coefficients, max} <j <y (cj _ cf y/ cf E 


where cj are the single-precision values of c; and oe the double-precision 
values; 
e the minimum and maximum error (computed in double precision), 


€min = min te) — f(te)|, Cmax = max te) — f(te)I. 
min min, len (te) A (te)|, max max |Pn (te) Sf (tk)| 


Make two runs: 
(a) N = 5,10,20; (b)N = 4. 


Comment on the results. 
4. Write a program for discrete polynomial least squares approximation of a 
function f defined on [-1,1], using the inner product 


N 


2 2i 
(u,v) = ai So u(tiv(ti), i = =1 4 WN’ 


i=0 
Follow these steps. 
(a) The recurrence coefficients for the appropriate (monic) orthogonal polyno- 


mials {7;,(¢)} are known explicitly: 


a =0, k=0)1,...,N2 By = 2, 


Bx = (+3) (-(FR)) («-3) k= 1,9) 258. 
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(b 


(c 


) 


wm 


5. (a) 


(b 


(c 


wm 


wa 
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(You do not have to prove this.) Define yy, = |||? = (2%, 2), which is 
known to be equal to BoB --- Bx (cf. Ex. 20). 

Using the recurrence formula with coefficients a, , 6, given in (a), generate 
an array ma of dimension (VN + 2,N + 1) containing m(te), k = 
0,1,...,N+1;4=0,1,..., N. (Here k is the row index and ¢ the column 
index.) Define wz, = maxo<e<n |W (te)|,k = 1,2,..., N41. Print Bx, yx, 
and 4,41 fork = 0,1,2,...,N, where N = 10. Comment on the results. 
With pn(t) = Yopao Cem (t), n = 0,1,..., N, denoting the least squares 
approximation of degree < n to the function f on [-1,1], define 


llenll2 = Bn — fll = on — f Bn — fy, 


len loo = max |Pn(ti) — f(G)I- 
0<i<N 


Using the array a generated in part (b), compute ¢,, len |l2, |len|lo, 2 = 
0,1,...,N, for the following four functions: 


fom=e", f=m2+1), fH=V1i+t, fO=ltl. 


Be sure you compute |e, ||2 as accurately as possible. For N = 10 and 
for each f, print Cy, |len|l2, and ||en||oo forn = 0,1,2,...,N. Comment 
on your results. In particular, from the information provided in the output, 
discuss to what extent the computed coefficients ¢, may be corrupted by 
rounding errors. 


A Sobolev-type least squares approximation problem results if the inner 
product is defined by 


Coo i u(t)v(t)ddo(t) + [ (vad (0), 


where dAo, dA; are positive measures. What does this type of approxima- 
tion try to accomplish? 

Letting dAg(t) = dt, dA, (t) = Adt on [0,2], where A > 0 is a parameter, 
set up the normal equations for the Sobolev-type approximation in (a) of 
the function f(t) = e on [0,2] by means of a polynomial of degree 
n — 1. Use the basis 7;(t) = t/~', 7 = 1,2,...,m. {Hint: express the 
components 5; of the right-hand vector of the normal equations in terms 
of the “incomplete gamma function” y(a, x) = i, t?—le“‘dt with x = 4, 
a =i/2.} 

Use Matlab to solve the normal equations form = 2: 5 and A = 0,.5, 1,2. 
Print 


IlPn — flloo and ||, — f'lloo, m= 2,3,4,5 


(or a suitable approximation thereof) along with the condition numbers of 
the normal equations. {Use the following values for the incomplete gamma 
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function: vG, 4) = 1.764162781524843, y(1, 4) = 0.9816843611112658, 

vG. 4) = 0.8454501 129849537, y(2, 4) =0.9084218055563291, vG. 4)= 
1.121650058367554.} Comment on the results. 

6. With w,(x) = Tes@ —k), let M,, be the largest, and m, the smallest, relative 

maximum of | @,(x)|. Forn = 5:5: 30 calculate M,,,m,, and M,,/my, using 

Newton’s method (cf. Chap 4, Sect. 4.6), and print also the respective number 


of iterations. 
7. (a) Write a subroutine that produces the value of the interpolation polynomial 
Pa(fi%0,%X1,---,%nit) at any real t, where n > O is a given integer, 


x; are n + 1 distinct nodes, and f is any function available in the 
form of a function subroutine. Use Newton’s interpolation formula and 
exercise frugality in the use of memory space when generating the divided 
differences. It is possible, indeed, to generate them “in place” in a single 
array of dimension n + 1 that originally contains the values f(x;), i = 
0,1,...,”. {Hint: generate the divided differences from the bottom up.} 
Run your routine on the function f(t) = eee —5 <t < 5, using 
xj = —-5+ 105, i= 0,1,...,n,andn = 2: 2: 8 (Runge’s example). 
Plot the polynomials against the exact function. 

8. (a) Write a Matlab function y=tridiag(n,a,b,c,v) for solving a tridi- 

agonal (nonsymmetric) system 


(b 


wm 


aq, ¢ 0 V1 VI 
by ay © y2 v2 
br a3 C3 3 V3 

Cn—-1 Yn-1 Vn—-1 
0 Dn-1 an Yn Vn 


by Gauss elimination without pivoting. Keep the program short. 


(b) Write a program for computing the natural spline interpolant spar(f; -) on 
an arbitrary partition a = x; < x2 < x3 <+++ < Xy,-1 < xX, = bof [a,b]. 
Print {i7, errmax(7);i = 1,2,...,2— 1}, where 

j-1 


errmax(i) = ite |Siapl fe ey) = Fsz)| » Xj = Xi W-1 Ax;j. 


(You will need the function tridiag.) Test the program for cases in which 
the error is zero (what are these, and why”). 

(c) Write a second program for computing the complete cubic spline inter- 
polant Scompi(/; +) by modifying the program in (b) with a minimum of 
changes. Highlight the changes in the program listing. Apply (and justify) 
a test similar to that of (b). 
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(d) Run the programs in (b) and (c) for [a, b] = [0, 1],n = 11, N = 51, and 
Oe = Rl RIS moe, 
Gi) x; = (1), 8 = 1,2,...,05 FO) = x5, 


Comment on the results. 
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11. (a) We have 


(1, Hs) = / eC ttge = ——_ ett] . 
° ae 0 r+s 
I 1 1 ‘ 
(7, St) = / edt So et = =(] sae?) 
0 A 0 r 


The normal equations, therefore, are 


n 


1 1 . 
5 Cs =—-(l-e”’), r=1,2,...,n. 
r+s r 


s=l 


The matrix is the Hilbert matrix of order n + 1 with the first column and 
last row removed. 


(b) PROGRAM 


SEXII_11B 


2 
6 


£0='’%$8.0£ %12.4e\n’; 
fl='%45.14e\n'; 
disp (’ n cond solution’ ) 
for n=1:8 
A=hilb(n+1) ; 
ACs, 2) <7) ¥ 
A(n+1,:)=[]; 
x=(1:n)’; 
b=(1-exp(-x)) ./x; 
c=A\b; 
cd=cond (A) ; 
fprintf (f£0,n,cd) 
fprintf(f1,c) 
for i=1:201 
t=.01«*«(i-1); 
fa(i,n)=sum(c.*exp(-x«*t)); 
end 
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for 2=1: 


201 


tfa(i)=.01*(i-1); 


end 
tt; 
on 


OUTPUT 


>> EXII 11B 
n 
1 Ts 


£); 


cond 
0000e+00 


-8474e+01 


-3533e+03 


-5880e+04 


-5350e+06 


.1098e+07 


solution 


-26424111765712e+00 


-00219345775339e+00 
-93071489855589e-01 


-23430987802214e+00 
-33908483295774e+00 
-45501111925180e+00 


-09728726098036e+00 
.58114152051443e+01 
-03996718636248e+01 
-55105210088422e+00 


-95960905289307e-01 
-29075627900844e+01 
-01167511196597e+01 
-26470845210147e+02 
-03098537899591e+01 


-68879580265092e+00 
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7 1.6978e+09 


8 5.6392e+10 
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-47821734938751e+01 
-03448008206274e+02 
-28966173654415e+02 
-62805182233655e+02 
-84248287095832e+02 


-19410815562677e+00 
-89096699709436e+01 
-44042318216034e+01 
-67846414188988e+02 
-16935587561045e+02 
-99544328640241e+02 
-66412000082403e+02 


-39677086853911e+00 
-42030165764484e+01 
-09672261269167e+03 
-45217770865201e+03 
-33593305457727e+04 
.71746576145770e+04 
-11498207678274e+04 
-88841304234284e+03 


The condition numbers here are even a bit larger than the condition numbers 
of the Hilbert matrices of the same order (cf. Chap. 1, MA 9). 


16.18 °° 2 


dotted line: n=1, dashdotted line: n = 2, dashed line: n=3, 
solid line n=4 
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16. (a) Let p(t) =a) + ait + dot”. Then p Satisfies the constraints if and only if 
ao = 0, ayo + a, + dz = 1, that is, 


pt) = + ait(1 —2). 


Therefore, we need to minimize 


1 1 
/ f(t) — p()Pat = / Lf) — 12 — at —1)Padr. 
0 0 


This is an unconstrained least squares problem for approximating the 
function f(t)—t? by a multiple of 2; (t) = ¢(1—1). The normal equation is 


af = nPdt = / (FO) Pn =e, 
and yields the solution 
p(t) = 0? + 4t(1—2), 
where 


. _ In (f)— Pt — sat 
Jie — 0) Pat 


1 
3 
= 30 [ f(t)t( — t)dt — 5" 


(b) If f(t) =f", then 


; 3 30 3 
4,=30) ¢t'ad—npdt- = = 
= [ Ce ee H(ir+3) 2 


and 


30 3 
(+2043) 3)a-9 


p= + ( 


=| 30 -3+(3- 30 )e 
~~" (@r+2)\(r+3) 2 2 (r+2)(r +3) , 


For r = 3, this gives p(t) = 5¢(3t —1). 
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Plot: 
0.2 0.1 02 #038 04 oe 06 #O7 O08 O09 1 
solid line: y = t? 
dashed line: y = .5t(3t — 1) 
27. PROGRAM 


$n=20; 

1=1:n+1; mu=1:n+1; 
equally spaced points 
X=-14+2*(i-1)/n; 


oe 


% Chebyshev points 
$x=Cos ((2« (1-1)+1) *«pi/(2«n+2)); 
iplot=0; 


for k=2:n+1 
sfor k=1:n+2 
for j=1:21 
iplot=iplot+1; 
t (iplot) =x (k-1) + (3-1) * (x(k) -x(k-1))/20; 
if k==1 
t (iplot) =1+(j-1)*(x(1)-1)/20; 
elseif k<=n+1 
t (iplot) =x (k-1)+(j-1) * (x(k) -x(k-1))/20; 
else 
t (iplot) =x (n+1)+(j-1) * (-1-x(n+1))/20; 
end 


JP 0 AP oP oP oP OP 
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36. 


s=0; 
for nu=1:n+1 
mu0=find(mu-nu) ; 
p=prod ((t (iplot) -x(mu0)) ./ (x (nu) -x(mu0) )) ; 
s=stabs(p); 
end 
leb(iplot)=s; 
end 
end 
plot (t,1lo0g10(leb) ) 


splot (t,leb) 

axis([-1.2 1.2 -.05 .55]) 
saxis([-1.2 1.2 -.1 1.6]) 
saxis([-1.2 1.2 -.25 4.25]) 
saxis([-1.1 1.1 .9 2.4]) 
$axis([-1.1 1.1 .9 2.6]) 
saxis([-1.1 1.1 .9 3]) 
title(’equally spaced points; n=5’, 


'Fontsize’ ,14) 

S$title(’equally spaced points; n=10’, 
'Fontsize’ ,14) 

Stitle(’equally spaced points; n=20', 
'Fontsize’ ,14) 

$title(’Chebyshev points; n=5’,’Fontsize’ ,14) 

$title(’Chebyshev points; n=10’,’Fontsize’ ,14) 

ylabel(’log lambda’ ,’Fontsize’ ,14) 

S$ylabel (’ lambda’ ,’Fontsize’ ,14) 


OUTPUT 
(on the next page) 


At the interpolation nodes x;, one clearly has 4,,(x;) = 1. The local maxima 
of A,, between successive interpolation nodes are almost equal, and relatively 
small, in case (b), but become huge near the endpoints of [—1, 1] in case (a). In 
case (b), the global maxima occur at the endpoints +1. 

We first prove the assertion of the Hint. One easily verifies that the function 
|(x — +) (x — =*)] on [0,1] is symmetric with respect to the midpoint 5. 


Being quadratic, its maximum must occur either at x = 0 or at x = 4, and 


3? 
i(n—i —2i)2 eae . 
Me) wee The former attains its maximum at 


i= = the latter ati = 0 (andi = n). Either one equals i. Thus, 


(—)) 


hence is the larger of and 


max 
0<x<1 


as claimed. 
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(a) equally spaced points; n=5 (b) Chebyshev points; n=5 
Sr 
0.5 
2.2 4 
0.44 4 2 | 
s 03 1.8 4 
3 © 
: 
a a 1.6} 4 
202 
1.4 4 
0.1 
1.25 4 
oO 1 J 
ee ee 
-1 -0.8 -0.6 -04 -0.2 0 02 04 06 08 1 -1 -08 -06 -04 -02 0 02 #04 06 08 1 
equally spaced points; n=10 Chebyshev points; n=10 
1.6 C6 er ee eer 
1.4 2.4 4 
1.2 2.2 4 
if 4 2b 4 
3 
2 os} 4 318} 4 
iS a 
gS € 
D gs 
206 1.6 4 
0.4 1.4 a 
0.2 1.2 4 
Oo 1 4 
a a ee 
-1 -0.8 -06 -04 -0.2 0 02 04 06 08 1 -1 -08 -0.6 -04 -02 0 02 #04 06 O08 1 
equally spaced points; n=20 Chebyshev points; n=10 
T T T T T 3 T T T T T 
4 
2.8 4 
S5P ) 26 4 
3 2.4 4 
2.5, 4 2.25 4 
3 s 
3 S ok 4 
5? F 
2 218 | 
“1.5 
1.6 4 
1p 4 
1.4 4 
0.5 12h A 
oF 4 ib 4 
1 1 1 1 1 1 1 1 1 1 
-1 -0.8 -06 -04 -02 0 02 04 06 08 1 -1 -08 -06 -04 -02 0 02 #04 O06 O08 al 
(a) We have 
ef) 


k 
e* — pl fix) = sy TI x--—]}, O0<&(x) <1. 


Here we use 
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n 


k i n-i 
I] x--|= I] x——||x- 
= n er n n 
along with the assertion of the Hint to obtain 
tt 
e e 
Poe x |e" — pal f3x)| a+)! (J ) ~ DFT + 1! 


The smallest n making the upper bound < 10~° isn = 7. 
(b) From Taylor’s formula, 


&O 
e* —t,(x) = Aap , O<&(x) <1. 
Thus, 
|e” —tn(x)| s = 
(n+ 1)! 


This bound is larger than the one in (a) by a factor of 2”*!. Accordingly, 
for it to be < 10~° now requires n = 9. 


47. We have, for0 < k,£ <n, 


(Tx, Te) > Th (Xp)Te (xy) = Yew («= nr] cos (c ar] 
— . m) +008 ( = =n) 

e}pewe wel y +4 3 eilk— —() 41 | 

elk +O 5, yet +e) hx + eilk-OF; ret os ; 
v=1 v=1 


Re 


Assume k # £. Both sums in the last equation are finite geometric series and 
can thus be summed explicitly. One gets 


1 ; p) 1 = eilkt+ Ox ae 1 — eilk—On 
(T, Te) = ; Re jero8 —qxr~ + lt Om | 
1 = e! n = 1 _ e n 
_ 1 Re ifl- elk + Or] ifi= cue 
a) : oat =) ’ 
2 sin S70 2 sin kta 


146 2 Approximation and Interpolation 


where the denominators are not zero by the assumption on k and ¢. Now if k+€ 
(and hence also k — €) is even, both expressions in brackets are zero. If k + € 
(and hence also k — £) is odd, the numerators are both 27, hence the real part 
equals zero. In either case, (7;, Tz) = 0, as claimed. 

An easy argument also shows that the value of the inner product is 5 if 
k=>0,andnifk =l=0. 

The result follows more easily from the continuous orthogonality 
(cf. Sect. 2.2.4, (2.99)) by applying the Gauss—Chebyshev quadrature formula 
(cf. Chap. 3, Sect. 3.2.3 and Ex. 36). 

57. (a) Forn = 1, the assertion of the Hint is true for all r > 0 since 


logigXr+1—logipxy = (F+1)—r 1 


Keay = Ke ~ 1070-1) 9- 10°" 


ben x41] = 


Thus, assume the assertion to be true for some n and all r > 0. Then, by 
the property (2.113) of divided differences, 


[xr, Xp+ls+++sXrtn; Xrtntilf 
= [Xr41, XpH2s-ees Xrantilf = [Xr, XpHlseees Xrtnl f 
Xr+n+1 — Xr 
_ (-1)""! 1— 10" 
_ 10" +n@—D/2(1 97 _ 1) 10°(107+" +1 = 10”) 
_ (-1)" 
1Q""+n(n-1)/2 1074 "10" +1 __ 1) 
(<1) 


~ Jor@+D+n@+D/2 (1O"FT — 1)’ 


which is precisely the assumed assertion with n replaced by n + 1. 
(b) Let ax = [x0,%1,...,Xx]f. By Newton’s formula, noting that a9 = 
logy) | = 0, we have 


Pn(x) = >> ag(x — 1)(x — 10)---(x — 10!) 
k=1 
n (=)! pet 
= » 10«&—)/2(1 0k — 1) He — 10°) 


=0 


n k-1 


1 £ 
--) TSS E aT [ee =) 


k=1 =0 
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n 1 k-1 
=-S*—_ — [[a-x/10%) 
2s 10% —1 II 


k=1 
n 

=~ DOK), 
k=1 


where 
k-1 


1 0 
£=0 
For 1 < x < 10, we have 


1—x)(1—x/10*—!) 
10k —1 


[tk (x)| < ( < : 1 : < 
i 10° —1 10e—1 ) ~ Tor" 


Thus, the infinite series }°7° ; t; (x) is majorized by the convergent geomet- 
ric series 9 )-7°_, 10~* and therefore also converges. However, for x = 9, 
one computes 


k- 


(o-e) [o-e) 8 1 
— 248 = Vig Ma -9710 
k=1 k=1 £=1 


= 0.89777... < logyy(9) = 0.95424... 


(For an analysis of the discrepancy, see Gautschi [2008].) 


75. Let s(x) = Yan B(x). Then, with points &, as defined, the first n — 1 


conditions imposed on s can be written as 


c, By (&1) + c2 Bo (&1) = fi, 


C2 Ba (&2) + ¢3 B3(&2) = fa, 


Cn—-1 Bn-1 (E,-1) + Ch By (n—1) = Sn-1 : 
The last condition imposed is, since B,(x,) = B,(x,) = 1, 


Cy —Cy, = 0. 
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The matrix of the system has the following structure: 


1 —1 


Note that B;(&;) A 0 for 7 = 1,2,...,n—1. 


Solution: (1) Subtract a suitable multiple of the first equation from the last 
equation to create a zero in position (7, 1). This produces a fill-in in position 
(n, 2). (2) Subtract a suitable multiple of the second equation from the last to 
create a zero in position (7,2). This produces a fill-in in position (”, 3), etc. 
After n — | such operations one obtains a nonsingular upper bidiagonal system, 
which is quickly solved by back substitution. 

79. (a) From Sect. 2.2.4, (2.140) and (2.141), the spline on [x;, x; +1] is 


(x) p(x) = c19 + e,1(x — x:) + c1.2(x — x)? + 7 3(x — x1)’, 


where 
[xi xi4i) f — mi 
Cio = fi, Ci = Mi, Cig = ———>— — 3 Axi, 
(*) Ax; 
* 
bas — 2[xi, x4) f 
i3= 

‘ (Ax;)? 

The two “not-a-knot” conditions are p/’(x2) = p'(x2), pr 5(Xn-1) = 


De. (Xn-1). By (*), this yields 
C143 = C23, Cn-2,3 = Cn-1,3- 
Substituting from ( ) , the first equality becomes 


m2 +m —2[x1,x2)f — m3 +m —2[x2, x3] f 
(Ax)? 7 (Ax2)? 


or, after some elementary manipulations, 
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me (1-5) m-($2)'m 
=9 (eeu = (2) tol =: by. 


Similarly, the second equality becomes 


AXxn—2 AXn-2 . 
Mn—-2 + 1— Mn-1 — | —— Mn 
AXn—-1 AXn-1 


@) Ax. : 
= 2 (Iv. sealf _ ( 2) ber) = by. 


() 


AXn-1 
(b) The first equation (for 7 = 2) from Sect. 2.2.4, (2.145), is 


(2) Ax2m, + 2(Ax, + Ax) m2+ Ax] m3 = bo. 


Multiply (2) by Cae and add to (1) to get the new pair of equations 
x 


Ax Ax, \? Ax 
ta jae (ie — | oe =b, +-—_< hp, 
Ax2 Ax2 2 


Ax2 my, + 2(Ax1 + Ax) m2 + Ax, m3 = bo. 


Ax, 
2 


This is the beginning of a tridiagonal system. 


Similarly, the last equation (for i = n — 1) from Sect. 2.2.4, (2.145), is 


(n ~~ 1) AXn 1My-2 + 2(AXxn 2 =P AXn 1) Mn 1 + AXn 2Mn = bn-1. 


Multiply Eq. (n — 1) by ma and subtract from (7); then the last two 
equations become 


AXn 1Mn—2 + 2(AXn 2+ AXn 1) Mn a AXn 2Mn = bn-1 


Neo com AXn— 1 
-(14 ==) Mn—1 — ise (1+ a 2) rn = br Dyas 


AXn-1 AXn-1 AXn-1 AXn-1 


This is the end of the tridiagonal system. 
(c) No: the system is not diagonally dominant, since in the first equation 
the diagonal element 1 + AX is less than the other remaining element 


Ax2 
2 
Ax 
(1 + st) : 
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4. (a), (b) and (c) 

PROGRAMS 


SMAII_ 4ABC 


function [beta,gamma,mu,coeff,L2err, 
maxerr] =MAII_4ABC(N) 

P=zeros (N+2,N+1); 1=0:N; t=-1+2«i/N; 

beta(1)=2; b=(1+1/N)*2; gamma(1)=2; 


for k=1:N 
beta (k+1) =b« (1- (k/ (N+1) ) *2) 
/(4-1/k*2) ; 
gamma (k+1) =beta (k+1) «gamma (k) ; 
end 


P(1,:)=1; P(2,:)=t; mu(1)=max(abs(t)); 
for k=2:N+41 
P(k4+1,:)=t.*P(k,:)-beta(k)*P(k-1,:); 
mu (k) =max (abs (P(k+1,:))); 
end 
for n=0:N 
coeff (n+1)=2*sum(P(n+1,:) 
.*£(t)) / ( (N+1) «gamma (n+1) ) ; 
end 
for n=0:N 
emax=0; e2=0; 
for k=1:N41 
e=abs (sum(coeff (1:n+1) ’ 
-*P(1:n+1,k))-£(t(k))); 
if e>emax, emax=e; end 
e2=e2+e°2; 
end 
L2err (n+1)=sqrt (2*e2/ (N+1)); 
maxerr (n+1) =emax; 
end 


function y=f (x) 
y=exp (-x); 
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OUTPUT 


>> runMAII_ 4ABC 


k 


js 
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Sy=log(2+x) ; 
S$y=sqrt (1+x) ; 
Sy=abs (x); 
SRUNMAII 4ABC Driver program for 
% MAII 4ABC 
oe 
° 
£0='%4.0£ %20.15f %23.15e %23.15e\n’; 
fl='%12.0£ %23.15e %12.4e %12.4e\n’'; 
disp ([’ k beta(k)’ ... 
gamma(k)’ ... 
i mu(k+1)/’]) 
N=10; 
[beta,gamma,mu,coeff,L2err,maxerr] ... 
=MAII 4ABC(N) ; 
for k=1:N+41 
fprintf(f£0,k-1,beta(k),gamma(k), ... 
mu (k) ) 
end 
fprintf(’\n’) 
disp([’ n coefficients’ 
L2 error max error’]) 
for k=1:N+41 
fprintf(f1,k-1,coeff(k),L2err(k), ... 
maxerr (k) ) 
end 
beta (k) gamma (k) mu (k+1) 
.000000000000000 2.000000000000000e+00 .-000000000000000e+00 
-400000000000000 8.000000000000002e-01 -999999999999999e-01 
.312000000000000 2.496000000000001e-01 -879999999999998e-01 
-288000000000000 7.188480000000004e-02 .152000000000001e-01 
-266666666666667 1.916928000000001e-02 -680000000000001e-02 
~242424242424242 4.647098181818186e-03 -351272727272728e-02 
.213986013986014 9.944140165289268e-04 -488738461538463e-02 
.180923076923077 1.799124436058489e-04 -269231888111895e-03 
.143058823529412 2.573806252055439e-05 .834253163307282e-03 
.100309597523220 2.581774692464279e-06 .068331109138542e-04 
-052631578947368 1.358828785507516e-07 .771414197649772e-17 
coefficients L2 error max error 
-212203623058161e+00 1.0422e+00 1.5061e+00 f(t) =exp(-t) 
.123748299778268e+00 2.7551e-01 3.8233e-01 
-430255798492442e-01 4.7978e-02 5.6515e-02 
- 774967744700318e-01 6.0942e-03 5.7637e-03 
-380735440218084e-02 5.9354e-04 7.1705e-04 
-681309127655327e-03 4.5392e-05 5.0328e-05 
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6 1.436822757642024e-03 2.7410e-06 3.1757e-06 
7 -2.041262174764023e-04 1.2894e-07 1.3676e-07 
8 2.539983220653184e-05 4.5185e-09 5.2244e-09 
9 -2.811356175489033e-06 1.0329e-10 1.4201le-10 
10 2.800374538854939e-07 6.2576e-14 7.9492e-14 
n coefficients L2 error max error 
0) 6.379455015198038e-01 4.8331e-01 6.3795e-01 f(t) =1n(2+t) 
Hl 5 .341350646266596e-01 7.3165e-02 1.0381le-01 
2 -1.436962628313260e-01 1.4112e-02 1.7593e-02 
3 5.149163971020859e-02 2.9261e-03 3.2053e-03 
4 -2.066713722106771e-02 6.1199e-04 8.2444e-04 
5 8.790920250468457e-03 1.2409e-04 1.4929e-04 
6 -3.863610725685766e-03 2.3550e-05 3.0261e-05 
7 1.730112538790927e-03 4.0112e-06 4.5316e-06 
8 -7.825109066954439e-04 5.7410e-07 6.9079e-07 
9 3.553730415235034e-04 5.9485e-08 8.1789e-08 
10 -1.613718817644612e-04 2.1657e-14 2.7534e-14 
n coefficients L2 error max error 
0 9.134654065768736e-0 5.7547e-01 9.1347e-01 f(t) =sqrt (1+t) 
1 6.165636213969754e-0 1.6444e-01 2.9690e-01 
2 -2.799173478370132e-0 8.6512e-02 1.2895e-01 
3 2.654751232178156e-0 4.9173e-02 7.8888e-02 
4 -2.969055755002559e-0 2.6985e-02 4.4684e-02 
5 3.416320385824934e-0 1.3632e-02 1.8447e-02 
6 -3.862066935480817e-0 6.1238e-03 9.0935e-03 
wi 4.215059528049150e-0 2.3530e-03 3.0774e-03 
8 -4.411293520434504e-0 7.2661e-04 9.1223e-04 
9 4.416771232533437e-0 1.5591e-04 2.1437e-04 
10 -4.229517654415031e-0 2.7984e-14 3.5305e-14 
n coefficients L2 error max error 
0) 5 .454545454545454e-0 4.5272e-01 5.4545e-01 £(t)=|t| 
1 5 .046468293750710e-17 4.5272e-01 5.4545e-01 
2 8.741258741258736e-0 1.1933e-01 1.9580e-01 
3 0.000000000000000e+00 1,.1933e-01 1.9580e-01 
4 -7.284382284382317e-0 6.3786e-02 1.1189¢6-01 
5 -5.429698379835253e-16 6.3786e-02 1.1189e-01 
6 1.531862745098003e+00 4.1655e-02 6.9107e-02 
7 -3.506197370654419e-15 4.1655e-02 6.9107e-02 
8 -6.118812656642364e+00 2.7776e-02 3.8191e-02 
9 2.061545898679865e-13 2.7776e-02 3.8191e-02 
10 7.535204475298005e+01 3.9494e-14 5.0709e-14 
>> 
Comments 


¢ Note that the last entry in the mu column vanishes, confirming that zy +1 
vanishes at all the N + 1 nodes fy (cf. Ex. 22(b)). 

¢ The calculation of ¢, is subject to severe cancellation errors as n 
increases. Indeed, from the formula for the coefficient ¢, (cf. (2.24)), 


N 
y= 2D) fa)an(te)/(N + Dyn), 


£=0 
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one expects yy - Cn, being a “mean value” of the quantities (tz), to 
have the order of magnitude of these quantitites, i.e., of 2,, unless there 
is considerable cancellation in the summation, in which case y,, + Cy is 
much smaller in absolute value than jz,,. That, in fact, is clearly observed 
in our output when 7 gets large. 

e The maximum error for n = N should be zero since zy interpolates. 
This is confirmed reasonably well in the output. 

¢ e': Note the rapid convergence. This is because the exponential 
function is an entire function, hence very smooth. 

e In(2 + t): Remarkably good convergence in spite of the logarithmic 
singularity at ¢ = —2, a distance of | from the left endpoint of [—1,1]. 

e /1 +1: Slow convergence because of f’(t) > oo ast — —1. There 
is a branch-point singularity att = —1. 

¢ |t| : Extremely slow convergence since f is not differentiable at t = 0. 
Since f is even, the approximation for n odd is exactly the same as the 
one for the preceding even n. This is evident from the Lz and maximum 
errors and from the vanishing of the odd-numbered coefficients. 


7. (a) PROGRAM 


SMAII_7AB 


2 
6 


hold on 
it=(0:100)’; t=-54+it/10; 
y=1./(1+t.%*2); 
plot(t,y,’k«’) 
axis([=5.5 5.5 =21.2 1.2]) 
for n=2:2:8; 
1=(O0:n)’; it=(0:100)’; 
X=-54+10«i/n; t=-54+1it/10; 
y=pnewt (n,x,t) 
plot (t,y) 
end 
hold off 


if 


%PNEWT 


° 
6 


function y=pnewt (n,x,t) 
d=zeros(n+1,1); 
d=£ (x); 
if n== 
y=d(1); 
return 
end 
for j=l1:n 
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for iamy=24 
d(i+1)=(d(i+1) -d(i)) / (x (141) -x(i+1-3)); 
end 


end 
y=d(n+1); 
for i=n:-1:1 


y=d(i)+(t-x(i)).«*y; 


end 


function y=f (x) 
y=l1./(1+x.*2); 


OUTPUT 


The interpolation polynomials are drawn as solid lines, the exact function 
as black stars. 


The “Runge phenomenon”, i.e., the violent oscillations of the interpolants 
near the end points, is clearly evident. 


8. (a) 


JP PPP PP NP NP? 


PROGRAM 


TRIDIAG 


Gauss elimination without pivoting for a nxn (not 
necessarily symmetric) tridiagonal system with nonzero 
diagonal elements a, subdiagonal elements b, superdiagonal 
element c, and right-hand vector v. The solution vector 

is y. The vectors a and v will undergo changes by the 


routine. 
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function y=tridiag(n,a,b,c,v) 
y=zeros(n,1); 
for i=2:n 
r=b(i-1)/a(i-1); 
a(i)=a(i)-re*c(i-1); 
v(i)=v(i)-rev(i-1); 
end 
y(n) =v(n) /a(n) ; 
for isn-1:-1:1 
y (i) =(v (i) -c (i) ey (441) ) /a(i); 
end 


(b) The natural spline on the interval [x;, x;+1] is (cf. (2.140), (2.141)) 


Snat(X) = cio tcia(x— xi) +C12(% — xi) + ci3(x#— xi), Xj LX S X41, 


where 
Cio = fi, 
Ci = Mi, 
_ ba, xis] f — mi m 
Ci.2 = — > €i3 AX, 
Ax; 
: mizi tm; — 2[x1, x14] f 
i3 = , 
(Ax;)? 
and the vector m = [m,,mo,...,™m,]' satisfies the tridiagonal system of 


equations (cf. Sect. 2.2.4, (b.3)) 


2m, +m = by 
(Ax2)m, + 2(Ax, + Ax2)m2 + (Ax1)m3 = bo 
(Axn Mp 2+ 2(Axn 2+ AXn My, i+ (Ax 2)Mn = bn-1 


My—| + 2my = bn 


where 


by = 3[x1, xo] f 
bo = 3{[(Ax2)[x1, x2] f + (Axi) [%2, x3] ff 


by = 3{(AXn D[Xn 2,Xn lf + (AXn a) [Xn+1. Xn] f} 
by = 3[Xn-1, Xn] f 
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PROGRAM (for natural spline) 


%MAII_8B 


z 
© 


£0='%8.0f %12.4e\n’; 
n=11; N=51; 
a=zeros(n,1); b=zeros(n-1,1); c=b; 
i=(1in)’; j=(1:N)"; 
x=(i-1)/(n-1); 
$x=((i-1)/(n-1)).%2; 
f=exp(-x); 
Sf=sqrt (x) .75; 
dx=x(2:n)-x(1l:n-1); df=(f£(2:n)-£(1:n-1))./dx; 
a(1)=2; a(n)=2; b(n-1)=1; ¢(1)=1; 
v(1)=3*d£(1); v(n)=3*df(n-1) ; 
a(2:n-1)=2« (dx(1:n-2)+dx(2:n-1)); 
b(1:n-2)=dx(2:n-1); c(2:n-1)=dx(1:n-2); 
v(2:n-1)=3« (dx (2:n-1) .*df(1:n-2)+dx(1:n-2).*df(2:n-1)); 
m=tridiag(n,a,b,c,v); 
cO=f(1:n-1); cl=m(1:n-1); 
c3=(m(2:n)+m(1:n-1) -2*df) ./(dx.*2); 
c2=(df-m(1:n-1)) ./dx-c3.«dx; 
emax=zeros (n-1,1); 
for i=1l:n-1 
xx=x (i) +((j-1)/(N-1) ) *dx(i); 
t=xx-x(i); 
s=c3 (i); 
s=t.*s+c2 (i); 
s=t.*St+cl(i); 
s=t.*s+c0(i); 
emax (1) =max (abs (s-exp(-xx))); 
% emax (i) =max (abs (s-sqrt (xx) .*5)); 
fprintf (£0,1,emax (i) ) 


end 


(c) For the complete spline, only two small changes need to be made, as 
indicated by comment lines in the program below. 


PROGRAM (for complete spline) 


S115) N=51+ 
a=zeros(n,1); b=zeros(n-1,1); c=b; 
i=(1:n)’; j=(1:N)'; 
x=(i-1)/(n-1); 
% x=((i-1)/(n-1)).*2; 
f=exp (-x) ; 
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(d) 


Pll 


oe 


end 
OUTPUT 
>> MAII_ 8B >> MAII_ 8c 
1 4.9030e-04 f(x) =exp (-x) 2.5589e-07 f(x) =exp (-x) 
2 1.3163e-04 natural 2.2123e-07 complete 
3 3.5026e-05 spline 2.0294e-07 spline 
4 9.5467e-06 uniform 1.8288e-07 uniform 
5 2.2047e-06 partition 1.6568e-07 partition 
6 4.2094e-07 1.4984e-07 
7 3.4559e-06 1.3568e-07 
8 1.2809e-05 1.2247e-07 
9 4.8441e-05 1. £190e=07 
10 1.8036e-04 9.7227e-08 
>> >> 


Sf=sqrt (x) .%5; 


The next statement is new and does not occur in the 


program of (b) 


fder_1=-1; fder_n=-exp(-1)j; 
sfder_1=0; fder_n=5/2; 
dx=x(2:n)-x(1:n-1); df=(£(2:n)-f£(1:n-1)) ./dx; 


The next two lines differ from the corresponding lines 


in the program of (b) 


a(1)=1; a(n)=1; b(n-1)=0; c(1)=0; 

v(1)=fder_1; v(n)=fder_n; 

a(2:n-1) =2* (dx(1:n-2)+dx(2:n-1)); 

b(1:n-2)=dx(2:n-1); ¢c(2:n-1)=dx(1:n-2); 
( ( 


m=tridiag(n,a,b,c,v); 
cO=f£(1:n-1); cl=m(1:n-1); 
c3=(m(2:n)+m(1:n-1)-2*df£)./(dx.*2); 
c2=(df-m(1:n-1)) ./dx-c3.«dx; 
emax=zeros (n-1,1); 
for i=1:n-1 
xx=x (i)+((j-1)/(N-1)) «dx (i); 
t=xx-x(i); 
s=c3 (i); 


S=t.*s+c2 (i 


i 
S=t.*st+cl(i); 


i 


i 


S=t.*s+c0 (i 


) 
) 
) 
emax (i) =max (abs (s-exp (-xx) )); 


% emax(i)=max (abs (s-sqrt (xx) .*5)); 


fprintf (f£0,i,emax(i) 


>> MAII_ 8B >> MAII_ 8c 
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1 2.0524e-04 £ (x) =x* (5/2) 4.4346e-05 £ (x) =x* (5/2) 
2 5.3392e-05 natural 9.9999e-06 complete 
3 .6192e-05 spline 4.7121le-06 spline 
4 2.7607e-06 uniform 3.9073e-07 uniform 
5 -2880e-06 partition 9.8538e-07 partition 
6 9.8059e-06 5.2629e-07 
7 3.4951e-05 4.7131e-07 
8 -3252e-04 3.6500e-07 
9 4.9310e-04 3.1193e-07 
10 1.8416e-03 2.4850e-07 
ae cr 
>> MAII_ 8B >> MAII_ 8c 
1 6.6901e-07 £(x)=x* (5/2) 1.0809e-07 £(x)=x* (5/2) 
2 2.3550e-07 natural 6.0552e-07 complete 
ec} 1.1749e-06 spline 9.1261e-07 spline 
4 1.6950e-06 nonuniform 1.3558e-06 nonuniform 
5 1.5853e-06 partition 1.7319e-06 partition 
6 1.6441e-05 2.1329e-06 
7 6.8027e-05 2.5242e-06 
8 3.2950e-04 2.9138e-06 
9 1.4755e-03 3.3225e-06 
10 6.5448e-03 3.6393e-06 
>> >> 
Comments 


Testing: For the natural spline, the error should be exactly zero if f is 
any linear function. (Not for arbitrary cubics, since f” does not vanish at 
x = 0 and x = 1, unless f is linear.) For the complete spline, the error 
is zero for any cubic, if one sets m; = f’(0) and m, = f’(1). Example: 
f(x) = x3, m, =0, m, = 3. 

The natural spline for the uniform partition is relatively inaccurate near 
the endpoints, as expected. 

The complete spline is uniformly accurate for f(x) = e ~ but still 
relatively inaccurate near x = 0 for f(x) = x*/? on account of the 
“square root” singularity (of f’”) at x = 0. 

Nonuniform partition (for f(x) = x°/?): The natural spline is accurate 
near x = O because of the nodes being more dense there, but is still 
inaccurate at the other end. The complete spline is remarkably accurate at 
both ends, as well as elsewhere. 


Chapter 3 
Numerical Differentiation and Integration 


Differentiation and integration are infinitary concepts of calculus; that is, they are 
defined by means of a limit process — the limit of the difference quotient in the first 
instance, the limit of Riemann sums in the second. Since limit processes cannot be 
carried out on the computer, we must replace them by finite processes. The tools to 
do so come from the theory of polynomial interpolation (Chap. 2, Sect. 2.2). They 
not only provide us with approximate formulae for the limits in question, but also 
permit us to estimate the errors committed and discuss convergence. 


3.1 Numerical Differentiation 


For simplicity, we consider only the first derivative; analogous techniques apply to 
higher-order derivatives. 

The problem can be formulated as follows: for a given differentiable function /, 
approximate the derivative f’(xo) in terms of the values of f at x9 and at nearby 
points x1, X2,...,X, (not necessarily equally spaced or in natural order). Estimate 
the error of the approximation obtained. 

In Sect. 3.1.1, we solve this problem by means of interpolation. Examples are 
given in Sect. 3.1.2, and the problematic nature of numerical differentiation in the 
presence of rounding errors is briefly discussed in Sect. 3.1.3. 


3.1.1 A General Differentiation Formula for Unequally 
Spaced Points 


The idea is simply to differentiate not f(-), but its interpolation polynomial 
Pn(fi%X0,.--,Xni +). By carrying along the error term of interpolation, we can 
analyze the error committed. 
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Thus, recall from Chap. 2, Sect. 2.2, that, given the n + 1 distinct points xo, 
X1,...,Xy,, we have 


F(x) = Pal fsx) + tn), (3.1) 


where the interpolation polynomial can be written in Newton’s form 


Pil J 3%) = Jo + @=— so) eo ails + &— xo) — xi) hos xi mal J 2 
+ (x — xX0)(X — X1) +++ (X — Xp—1)[X0, 1, 6+. Xn (3.2) 


and the error term in the form 


(n+1 
fo (2 =) =a) fa =) ED (3.3) 


assuming (as we do) that f has a continuous (” + 1)st derivative in an interval that 
contains all x; and x. Differentiating (3.2) with respect to x and then putting x = xo 
gives 
Py (F3 X0) = [x0, x1] f + co — x1) [0,41 2) f + 
+ (Xo — 1) (0 — X2) +++ (Ko — Xn-1) 0, *1,--- IF BA) 


Similarly, from (3.3) (assuming that f has in fact n + 2 continuous derivatives in 
an appropriate interval), we get 


(n+1) 
i (ag) = (80 — 21) (a = 39) ++ y= a) SE. (3.5) 


Therefore, differentiating (3.1), we find 


F'(%0) = Py (fi X0) + en (3.6) 
where the first term on the right, given by (3.4), represents the desired approxima- 
tion, and the second, 

éx = F(X), (3.7) 
given by (3.5), the respective error. If H = max|xo — x;|, we clearly obtain from 
(3.5) that 

e, = O(H") as H > 0. (3.8) 


We can thus get approximation formulae of arbitrarily high order, but those with 
large n are of limited practical use; cf. Sect. 3.1.3. 
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3.1.2 Examples 


The most important uses of differentiation formulae are made in the discretization 
of differential equations — ordinary or partial. In these applications, the spacing of 
the points is usually uniform, but unequally distributed points arise when partial 
differential operators are to be discretized near the boundary of the domain of 
interest. 


1. n=1,x; = xo +h. Here, 


Pi fix) = box = 22, 


and (3.6) in conjunction with (3.7) and (3.5) gives 


Sfp, = TO, 


f'(%) = 5 


(3.9) 


provided f € C*[xo, x1]. (Taylor’s formula, actually, shows that f € C?[xo, x1] 
suffices.) Thus, the error is of O(h) as h > 0. 

2. n= 2,x, = xo +h, x2 = xo—h. We also use the suggestive notation x. = x_1, 
fo = fu. Here, 


PF; x0) = [x0, x1] f + (%0 — *1)[0, x1, xa] f- (3.10) 


The table of divided differences 1s: 


xX-1 f-1 
xo fo Jo“ Fa —f 
“a fi- fo fi =2fot+ fi 


h 2h? 


Therefore, 


A-fo_,f-2h+ i _fi=fa 
h 2h2 ~- i, * 


Py(f 3X0) = 


and (3.6), (3.7), and (3.5) give, if f € C?[x_1, x1], 


fats ea 
Th + eo, h 5 


Both approximations (3.9) and (3.11) are difference quotients; the former, 
however, is “one-sided” whereas the latter is “symmetric.” As can be seen, the 
symmetric difference quotient is one order more accurate than the one-sided 
difference quotient. 


f'(X0) = (3.11) 
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Fig. 3.1 Partial derivative near the boundary 


3. n=2,x; = Xo +h, x2 = Xo + 2h. In this case, we have the following table of 
divided differences, 


xo fo 
Xx] pie 
7 pan fe fo 
ne Th 2h? 
and (3.10) now gives 
fim) = Doh _, Bo 2At fe _ ht 4h-3ho. 
pF 7) oh 


hence, by (3.7) and (3.5), 


fit) = APM, LO. 


" (3.12) 


Compared to (3.11), this formula also is accurate to O(h?), but the error is now 
about twice as large, in modulus, than before. One always pays for destroying 
symmetry! 
4. For a function u = u(x,y) of two variables, approximate du/dx “near the 

boundary.” 

Consider the points Po(xo, yo), Pi(Xo + 1, yo), Pa(xo0 — Bh, yo), 0 < B < 
1 (see Fig. 3.1); the problem is to approximate (du/0dx)(Po) in terms of up = 
u(Po), uy = u( P|), ug = u( Pe). 

The relevant table of divided differences is: 


XB UB 
ug — UB 
Bh 
uj —Up B(uy — uo) — (Uo — UB) 


h Bh + B)h 


Xo Uo 


Xx) Uy 
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Thus, 


fe pg Bg tea) _ Pe ip 
inn Bh+ Bho BU + Bh 


and the error is given by 


B., Pu 
a= aa h? axa 0): 


3.1.3 Numerical Differentiation with Perturbed Data 


Formulae for numerical differentiation become more accurate as the spacing h 
between evaluation points is made smaller, provided the function to be differentiated 
is sufficiently smooth. This, however, is true only in theory, since in practice the 
data are usually inaccurate, if for no other reason than rounding, and the problem 
of cancellation becomes more acute as h gets smaller. There will be a point of 
diminishing returns, beyond which the errors increase rather than decrease. 

To give a simple analysis of this, take the symmetric differentiation formula 
(3.11), 


= m 
f'(x0) = re +e, 2 = — (3.13) 
Suppose now that what are known are not the exact values f4; = f(xo + A), but 
slight perturbations of them, say, 
FHfpnte: cy =Mtew. lesi| Hs. (3.14) 
Then our formula (3.13) becomes 
* * _ 
f' (x0) = ‘ies Be aah +e. (3.15) 


2h 2h 


Here, the first term on the right is what we actually compute (assuming, for 
simplicity, that 4 is machine-representable and roundoff errors in forming the 
difference quotient are neglected). The corresponding error, therefore, is 


= f/x - 2 a =-4 i +e 


and can be estimated by 


M 
[Eo < +2, My = max |” (3.16) 


1 [x—1.x1] 
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noise 


™~ truncation 


=h 
Fig. 3.2. Truncation and noise error in numerical differentiation 
The bound on the right is best possible. It consists of two parts, the term ¢/h, which 


is due to noise in the data, and i M3h?, which is the truncation error introduced by 
replacing the derivative by a finite difference expression. Their behavior is shown in 


Fig. 3.2. 
If we denote the bound in (3.16) by E(h), 
& M; 
E(h) = —+—)h’, 3.17 
SO ae uae" (3.17) 


then by determining its minimum, one finds 


1/3 
E(h) > E(ho), ho = (<7) ; 


and 


M\ 13 
Eto) = 3 (>) 2/3, (3.18) 


This shows that even in the best of circumstances, the error is O(7/*), not O(e), as 
one would hope. This represents a significant loss of accuracy. 

The same problem persists, indeed is more severe, in higher-order formulae. 
The only way one can escape from this dilemma is to use not difference formulae, 
but summation formulae, that is, integration. But to do this, one has to go into the 
complex plane and assume that the definition of f can be extended into a domain 
of the complex plane containing x9. Then one can use Cauchy’s theorem, 


’ _ 1 f@ _ 1 ii 10) i0 
f' (x0) = mai? G=a) dz= sf e f(xo t+ re’ )dé, (3.19) 


in combination with numerical integration (cf. Sect. 3.2). Here, C was taken to be 
a circular contour about x9 with radius r, with r chosen such that z = xo + re? 
remains in the domain of analyticity of f. Since the result is real, one can replace 
the integrand by its real part. 
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3.2 Numerical Integration 


The basic problem is to calculate the definite integral of a given function /, 
extended over a finite interval [a,b]. If f is well behaved, this is a routine problem 
for which the simplest integration rules, such as the composite trapezoidal or 
Simpson’s rule (Sect. 3.2.1) will be quite adequate, the former having an edge over 
the latter if f is periodic with period b—a. Complications arise if f has an integrable 
singularity, or the interval of integration extends to infinity (which is just another 
manifestation of singular behavior). By breaking up the integral, if necessary, into 
several pieces, it can be assumed that the singularity, if its location is known, is at 
one (or both) ends of the interval [a, b]. Such “improper” integrals can usually be 
treated by weighted quadrature; that is, one incorporates the singularity into a weight 
function, which then becomes one factor of the integrand, leaving the other factor 
well behaved. The most important example of this is Gaussian quadrature relative 
to such a weight function (Sects. 3.2.2—3.2.4). Finally, it is possible to accelerate the 
convergence of quadrature schemes by suitable recombinations. The best-known 
example of this is Romberg integration (Sect. 3.2.7). 


3.2.1 The Composite Trapezoidal and Simpson’s Rules 


These may be regarded as the workhorses of numerical integration. They will do the 
job when the interval is finite and the integrand unproblematic. The trapezoidal rule 
is sometimes surprisingly effective even on infinite intervals. 

Both rules are obtained by applying the simplest kind of interpolation on 
subintervals of the decomposition 


a 
a=X0 < xX <X2 <0) < Xp <X, =O, x, =atkh, eis (3.20) 
n 


of the interval [a,b]. In the trapezoidal rule, one interpolates linearly on each 
subinterval [x,, x1], and obtains 


Xk+1 Xk+1 Xk+1 
/ fedex = ‘| pilfhixex + / Rixdx, 3.21) 


Xk Xk Xk 


where 


Pi(fsx) = fi + (x — xn) eK, Xe S 


Ry (x) = (% — xR) — Xe41) = (3.22) 


Here, f; = (xx), and we assumed that f € C?[a,b]. The first integral on the 
right of (3.21) is easily obtained as the area of a trapezoid with “bases” fx, fr+1 and 
“height” h, or else by direct integration of pi(f; -) in (3.22). To the second integral, 
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we can apply the Mean Value Theorem of integration, since (x — x, )(x — X¢+41) has 
constant (negative) sign on [x,, X,+1]. The result is 


et h I 
[OO fenae = 5 i + fie — GPG). G.23) 


Xk 


where x, < & < x;41. This is the elementary trapezoidal rule. Summing over all 
subintervals gives the composite trapezoidal rule 


b 


Aide sh (; peers fr) + EMP), 3.24) 


a 


with error term 
n—1 


1 
Ev(f)=—7yh Df"). 
k=0 


This is not a particularly elegant expression for the error. We can simplify it by 
writing 


1 1 n—1 P 
Ls Roo] » f «| 


and noting that the expression in brackets is a mean value of second-derivative 
values, hence certainly contained between the algebraically smallest and largest 
value of the second derivative f” on [a,b]. Since the function f” was assumed 
continuous on [a,b], it takes on every value between its smallest and largest, in 
particular, the bracketed value in question, at some interior point, say, &, of [a,b]. 
Consequently, 


EHS 50 —a)h’ f"(&), a<é&<b. (3.25) 


Since f” is bounded in absolute value on [a,b], this shows that E'(f) = O(h7) 
as h — 0. In particular, the composite trapezoidal rule converges as h — 0 (or, 
equivalently, n — oo) in (3.24), provided f € C?[a, D]. 

It should be noted that (3.25) holds only for real-valued functions f and cannot 
be applied to complex-valued functions; cf. (3.29). 

One expects an improvement if instead of linear interpolation one uses quadratic 
interpolation over two consecutive subintervals. This gives rise to the composite 
Simpson’s formula.! Its “elementary” version, analogous to (3.23), is 


sea h 1 
/ I (x)dx = 3 Sit A tit fict2)— 55 PFO ER), XK < Ee < Xe42, (3.26) 


Xk 


'Thomas Simpson (1710-1761) was an English mathematician, self-educated, and author of many 
textbooks popular at the time. Simpson published his formula in 1743, but it was already known to 
Cavalieri [1639], Gregory [1668], and Cotes [1722], among others. 
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where it has been assumed that f € C*[a,b]. The remainder term shown in (3.26) 
does not come about as easily as before in (3.23), since the Mean Value Theorem is 
no longer applicable, the factor (x — x) (x — x%+41) (X — X42) changing sign at the 
midpoint of [xx , Xx+2]. However, an alternative derivation of (3.26), using Hermite 
interpolation (cf. Ex. 9), not only produces the desired error term, but also explains 
its unexpectedly large order O(h°). If n is even, we can sum up all 7/2 contributions 
in (3.26) and obtain the composite Simpson’s rule, 


b 
/ f(x)dx = 2 taf +2fp+4ft2fat--+4fr-it+ fr) + EX(/), 


Ef) =-eb- aM f%@, a<é<b, @.27) 


The error term in (3.27) is the result of a simplification similar to the one previously 
carried out for the trapezoidal rule (cf. Ex. 9(c)). Comparing it with the one in (3.25), 
we see that we gained two orders of accuracy without any appreciable increase in 
work (same number of function evaluations). This is the reason why Simpson’s 
rule has long been, and continues to be, one of the most popular general-purpose 
integration methods. 

The composite trapezoidal rule, nevertheless, has its own advantages. Although 
it integrates exactly polynomials of degree 1 only, it does much better with 
trigonometric polynomials. Suppose, indeed (for simplicity), that the interval [a,b] 
is [0,27], and denote by T,,[0,27] the class of trigonometric polynomials of 
degree m, 


Tn [0, 27] = {t(x) : t(x) = ao + a, cos x + az Cos 2x + +++ + Gy COSMX 


+ by, sinx + by sin2x +---+ by sinmx}. 


Then 
El(f) =0 forall f € Tn-1[0, 27]. (3.28) 


This is most easily verified by taking for f the complex exponential e,(x) = e!”* 
(= cosvx +isinvx), v =0,1,2,...: 


Hejel apie | owas ee 12 
Tew) =f extaddx— =] 5 e4(0)+ > evtk-2n/n) + 5 e402) 


k=1 


n—-1 


2n 
= el’ dx = 20 S eivk 2n/n_ 
0 n 
k=0 


When v = (0, this is clearly zero, and otherwise, since fi e'Xdx = (i v)! 


i 2n 
7 en = 0, 


—2n if vy =0(modn), v >0, 
Eh (ey) = nx 1— eivn-2x/n (3.29) 


— = [ata = 0 if v #0 (mod). 
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In particular, EX (ey) = 0 forv = 0,1,...,2 — 1, which proves (3.28). Taking real 
and imaginary parts in (3.29) gives 


Pease —2nx,v=0(modn), v £0,  Ev@inv) <0. 
0 otherwise, 


Therefore, if f is 27-periodic and has a uniformly convergent Fourier expansion 


CO 


f(xy) = YS [av(f) cos vx + by(f) sin vx], (3.30) 


v=0 


where a,(f), b,(/) are the “Fourier coefficients” of f, then 


Me 


E,(f) [a.(f)E, (cosv+) + by (f)E;, (sin v -)] 


< 
ll 
So 


—20 \*aen(f). (3.31) 


é=1 


From the theory of Fourier series, it is known that the Fourier coefficients of f go to 
zero faster the smoother f is. More precisely, if f € C’[R], then a,(f) = O(v") 
as v —> oo (and similarly for b,(f)). Since by (3.31), E'(f) ~ —2xa,(f), it 
follows that 


E\(f) = O(n") as n> 00 (f € C’[R], 22—periodic), (3.32) 


which, if r > 2, is better than E)(f) = O(n77), valid for nonperiodic functions /. 
In particular, if r = oo, then the trapezoidal rule converges faster than any power of 
n—!. It should be noted, however, that f must be smooth on the whole real line R. 
(See (3.19) for an example.) Starting with a function f € C"[0, 27] and extending 
it periodically to R will not in general produce a function f € C’[R]. 

Another instance in which the composite trapezoidal rule excels is for functions 
f defined on R and having the following properties for some r > 1, 


sec p, [ise enlax <oo, 
R 
lim f(x) = lim fF YX) =0, p=1,2,...,7 (3.33) 
x= 0 Xoo 


In this case, it can be shown that 


[ sooo =n > fkh) + E(fsh) (3.34) 
R 


k=—o00 
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has an error E(f;/) satisfying E(f;h) = O(h?'*!), h — 0. Therefore, again, if 
(3.33) holds for allr = 1,2,3,..., then the error goes to zero faster than any power 
of h. 


3.2.2 (Weighted) Newton—Cotes and Gauss Formulae 


A weighted quadrature formula is a formula of the type 
b n 
f(t)w(dt = 7 we f(te) + En(f), (3.35) 
¢ k=1 


where w is a positive (or at least nonnegative) “weight function,” assumed integrable 
over (a,b). The interval (a,b) may now be finite or infinite. If it is infinite, we must 
make sure that the integral in (3.35) is well defined, at least when f is a polynomial. 
We achieve this by requiring that all moments of the weight function, 


b 
Ls =| t*w(t)dt, s =0,1,2,..., (3.36) 


exist and be finite. 
We say that the quadrature formula (3.35) has (polynomial) degree of exactness 
d if 
E,(f) =0 forall f € Pa; (3.37) 
that is, the formula has zero error whenever f is a polynomial of degree < d. We 
call (3.35) interpolatory, if it has degree of exactness d = n — 1. Interpolatory 
formulae are precisely those “obtained by interpolation,” that is, for which 


n b 
So wef (te) =| Pr—(fistis.--stnit)w(t)dt, (3.38) 
k=1 ss 
or, equivalently, 
b 
Wk -| Ly, (t)w(t)dt, k = 1,2,...,n, (3.39) 
where 
"  t—t 
&(t) =|] (3.40) 


t= a 

lk 
are the elementary Lagrange interpolation polynomials associated with the nodes 
ti, t2,...,t,. The fact that (3.35) with w; given by (3.39) has degree of exactness 
d = n—1 is evident, since for any f € P,-; we have p,-1(f;-) = f(-) in 
(3.38). Conversely, if (3.35) has degree of exactness d = n—1, then putting f(t) = 
€,(t) in (3.35) gives [? £,(t)w(t)dt = Dh_, wely(te) = wrt = 1,2,...,0, that 
is, (3.39). 
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We see, therefore, that given any m distinct nodes ty, f2,...,f;, it is always 
possible to construct a formula of type (3.35), which is exact for all polynomials 
of degree < n — 1. In the case w(t) = | on [-1,1], and t, equally spaced on [- 
1,1], the feasibility of such a construction was already alluded to by Newton in 
1687 and implemented in detail by Cotes” around 1712. By extension, we call the 
formula (3.35), with the t; prescribed and the w; given by (3.39), a Newton—Cotes 
formula. 

The question naturally arises whether we can do better, that is, whether we 
can achieve d > n — 1 by a judicious choice of the nodes & (the weights wx 
being necessarily given by (3.39)). The answer is surprisingly simple and direct. 
To formulate it, we introduce the node polynomial 


on(t) = [|] 1%). (3.41) 


k=1 


Theorem 3.2.1. Given an integer k with0O < k <n, the quadrature formula (3.35) 
has degree of exactness d = n—1+k if and only if both of the following conditions 
are satisfied. 


(a) The formula (3.35) is interpolatory. 
(b) The node polynomial w, in (3.41) satisfies fe On (t) p(t)w(t)dt = 0 for all 
pe Pr. 


The condition in (b) imposes k conditions on the nodes f), f2,..., t, of (3.35). (If 
k = 0, there is no restriction since, as we know, we can always get d = n — 1.) In 
effect, w, must be orthogonal to P;,_, relative to the weight function w. Since w(t) > 
0, we have necessarily k < n; otherwise, @, would have to be orthogonal to P,,, in 
particular, orthogonal to itself, which is impossible. Thus, k = n is optimal, giving 
rise to a quadrature rule of maximum degree of exactness dyax = 2 — 1. Condition 
(b) then amounts to orthogonality of @, to all polynomials of lower degree; that is, 
Qn(-) = In(-;w) is precisely the nth-degree orthogonal polynomial belonging to 
the weight function w (cf. Chap. 2, Sect. 2.1.4(2)). This optimal formula is called 
the Gaussian quadrature formula associated with the weight function w. Its nodes, 
therefore, are the zeros of z,,(-; w), and the weights wz are given as in (3.39); thus, 


Tn (tks w) = 0, 


m= ff a — An(tiw) ns 


t — ty) 1) (ths w) 


R= 13:20 oh (3.42) 


Roger Cotes (1682-1716), precocious son of an English country pastor, was entrusted with the 
preparation of the second edition of Newton’s Principia. He worked out in detail Newton’s idea 
of numerical integration and published the coefficients - now known as Cotes numbers — of the 
n-point formula for all 1 < 11. Upon his death at the early age of 33, Newton said of him: “If he 
had lived, we might have known something.” 
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The formula was developed in 1814 by Gauss* for the special case w(t) = 1 on 
[-1,1], and extended to more general weight functions by Christoffel* in 1877. It is, 
therefore, also referred to as the Gauss—Christoffel quadrature formula. 


Proof of Theorem 3.2.1. We first prove the necessity of (a) and (b). Since, by 
assumption, the degree of exactness is d = n—1+k > n —1, condition (a) is 
trivial. Condition (b) also follows immediately, since, for any p € P;—1, the product 
Oy, p is in P14; hence, 


b n 
[ en pewear = Y wien(t) ptt) 
sg k=1 


which vanishes, since w,(t,) = 0 fork = 1,2,...,n. 
To prove the sufficiency of (a), (b), we must show that for any p € Py-14x% we 
have E,,(p) = 0 in (3.35). Given any such p, divide it by @,, so that 


P=qontr, gq € Pri, r€Pn-i, 


where q is the quotient and r the remainder. There follows 


b b b 
/ p(t)w(t)dt =| a(ton(w(dat + | r(t)w(t)dt. 


The first integral on the right, by (b), is zero, since g € Px—), whereas the second, 
by (a), since r € P,1, equals 


Dower te) = Do wel) — a(t )On(te)] = D5 we ote), 
k=1 k=1 k=1 


the last equality following again from w, (t,) = 0,k = 1,2,...,n. This completes 
the proof. Oo 


3Carl Friedrich Gauss (1777-1855) was one of the greatest mathematicians of the 19th century — 
and perhaps of all time. He spent almost his entire life in Gottingen, where he was director of the 
observatory for some 40 years. Already as a student in Gottingen, Gauss discovered that the regular 
17-gon can be constructed by compass and ruler, thereby settling a problem that had been open 
since antiquity. His dissertation gave the first proof of the Fundamental Theorem of Algebra (that an 
algebraic equation of degree n has exactly n roots). He went on to make fundamental contributions 
to number theory, differential and non-Euclidean geometry, elliptic and hypergeometric functions, 
celestial mechanics, geodesy, and various branches of physics, notably magnetism and optics. His 
computational efforts in celestial mechanics and geodesy, based on the principle of least squares, 
required the solution (by hand) of large systems of linear equations, for which he used what today 
are known as Gauss elimination and relaxation methods. Gauss’s work on quadrature builds upon 
the earlier work of Newton and Cotes. 


4Elvin Bruno Christoffel (1829-1900) was active for short periods of time in Berlin and Zurich 


and, for the rest of his life, in Strasbourg. He is best known for his work in geometry, in particular, 
tensor analysis, which became important in Einstein’s theory of relativity. 
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The case k = n of Theorem 3.2.1 (i.e., the Gauss quadrature rule) is discussed 
further in Sect. 3.2.3. Here, we still mention two special cases with k < n, which 
are of some practical interest. The first is the Gauss—Radau? quadrature formula in 
which one endpoint, say, a, is finite and serves as a quadrature node, say, ft) = a. 
The maximum degree of exactness attainable then is d = 2n —2 and corresponds to 
k = n—1in Theorem 3.2.1. Part (b) of that theorem tells us that the remaining nodes 


to,...,t, must be the zeros of 2,—1(-; wa), where w(t) = (t — a)w(t). Similarly, 
in the Gauss—Lobatto® formula, both endpoints are finite and serve as nodes, say, 
ty = a, t, = b, and the remaining nodes f,...,t,—1 are taken to be the zeros of 


Tin—2(+3Wa.b), Wa p(t) = (t — a)(b — t)w(t), thus achieving maximum degree of 
exactness d = 2n — 3. 

Example: Two-point Newton—Cotes vs. two-point Gauss 

We compare the Newton—Cotes with the Gauss formula in the case n = 2 and 
for the weight function w(t) = t~!/? on [0,1]. The two prescribed nodes in the 
Newton—Cotes formula are taken to be the endpoints; thus, 


1 whe f(O) + we F(1) (Newton-Cotes), 
/ tl? F(t)dt x 
w@ f(t) + wS f(t) (Gauss). 


To get the coefficients in the Newton—Cotes formula, we use (3.39), where 


t-1 t—0 
£\(t) = —— =1-t, &(t) = —— = 1. 
i(t) = 2(t) 10 
This gives 
: ! y Pail 
wNc = | LG = | (¢'/? — ¢!/?)at = (2012 - 51) =) 
0 0 3 0 3 
1 1 1 
2 2 
wee =) t 1/705 (t)dt -| ode = 17/7) = =. 
0 0 3 0 3 


>Jean-Charles-Rodolphe Radau (1835-1911) was born in Germany but spent most of his life in 
France. He was strongly attracted to classical music (the French composer Jaques Offenbach was 
one of his acquaintances) as he was to celestial mechanics. A gifted writer, he composed many 
popular articles on topics of scientific interest. He was a person working quietly by himself and 
staying away from the spotlight. 


®Rehuel Lobatto (1797-1866), a Dutch mathematician of Portuguese ancestry, although very gifted 
in his youth, stopped short of attaining an academic degree at the University of Amsterdam. He had 
to wait, and be satisfied with a low-level government position, until 1842 when he was appointed a 
professor of mathematics at the Technical University of Delft. His mathematical work is relatively 
unknown, but he has written several textbooks, one of which, on calculus, published in 1851, 
contains the quadrature rule now named after him. 
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Thus, 


1 
[ 1? reat = 5 AF) + FO) + EXP), 3.43) 


Note how the square root singularity at the origin causes the value f(0) to receive a 
weight twice as large as that of f(1). 

To develop the Gauss formula, we first construct the required orthogonal 
polynomial 


2(t) = i= Pit + po. 
Since it is orthogonal to the constant 1, and to ¢, we get 


1 1 
0= / t|/?715(t) dt = y (er _ nie a pot '/) dt 
0 0 


1 


b 2. 
15 — = put? + 2pat'??) = 5-7 Pi +2pr, 


0 


1 1 
0= / t7'/? . tm(t)dt = / (15/2 — pyt3/? + pot?) dt 
0 0 


?) y) 2 +. Fg 2 
ol Cine ata | ete 
0 


that is, the linear system 


1 i 

37! P2 = 5” 

1 1 . 

5 ao 7 

The solution is py = g pPo= a; thus, 

6 3 
j=aPr= f+. 
72() q** 35 


The Gauss nodes — the zeros of zz — are therefore, to ten decimal places, 


1 6 1 6 
i= 7 ( 2/2) = 0.1155871100, t = 7 (: + 2/8) = 0.7415557471. 
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For the weights we, wS, we could use again (3.39), but it is simpler to set up a linear 


system of equations, which expresses the fact that the formula is exact for f(t) = 1 
and f(t) =t: 


1 
we + wS = / 1 dt = 2, 
0 


1 
2: 
tw? + tw? = t'/? .tdt = 3" 
0 


This yields 


—2t, + 2 2t, — 2 
y= i = 3, 
h—b h—b 


or, with the values of t, f2 substituted from the preceding, 


" 1 [5 ‘ 1 [5 
wh = 1+ 3y/Z = 13042903097, wf = 1— 54/7 = 0.6957096903. 


Again, wo is larger than w8, this time by a factor 1.87476.... 
Summarizing, we obtain the Gauss formula 


— 1 /5 1 6 1 {3 
frre = (145 2) (5 (0-2 S\)+ (1-3 ;) 
i (; (: + 2/)) + ES(f). (3.44) 


To illustrate, consider f(t) = cos (471); that is, 
! 1 
I= 7 t—'/? cos (57) dt = 2C(1) = 1.5597868008.... 
0 


(C(x) is the Fresnel integral i cos (57t7)dt.) Newton—Cotes and Gauss give the 
following approximations, 


4 
INC = a= 1.3333..., 


TS = 1,2828510665 + 0.2747384931 = 1.5575895596. 


The respective errors are 
a = 0.226453..., hd = 0.002197..., 


demonstrating the superiority of the Gauss formula even for n = 2. 
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3.2.3 Properties of Gaussian Quadrature Rules 


The Gaussian quadrature rule (3.35) and (3.42), in addition to being optimal, has 
some interesting and useful properties. The more important ones are now listed, 
with most of the proofs relegated to the exercises. 


(a) All nodes t, are real, distinct, and contained in the open interval (a,b). This 
is a well-known property satisfied by the zeros of orthogonal polynomials 
(cf. Ex. 32). 

(b) All weights w, are positive. The formula (3.39) for the weights gives no clue as 
to their signs; however, an ingenious observation of Stieltjes’ proves it almost 
immediately. Indeed, 


b n 
o<f £5 (w(t)dt = D> wel} (i) = Wj, (ce er 
a k=1 


the first equality following since e € Pon,—2 and the degree of exactness is 
d=2n-1. 

(c) If [a,b] is a finite interval, then the Gauss formula converges for any continuous 
function; that is, E,(f) — 0 asn — co whenever f € C[a,b]. This is 
basically a consequence of the Weierstrass Approximation Theorem, which 
implies that, if f2,-1(f;-) denotes the polynomial of degree 2n — 1 that 
approximates f best on [a,b] in the uniform norm, then 


Jim fC) = Bar-i(fs lloo = 0. 


Since E,,(P2n—1) = 0 (polynomial degree of exactness d = 2n — 1), it follows 
that 


|En(f)| = |E,(f — Pon-1)| 


b n 
=| [160 = Ban aft wienae — Yo wel flte) = Para Ft0 
% k=1 


7Thomas Jan Stieltjes (1856-1894), born in the Netherlands, studied at the Technical Institute 
of Delft, but never finished to get his degree because of a deep-seated aversion to examinations. 
He nevertheless got a job at the Observatory of Leiden as a “computer assistant for astronomical 
calculations.” His early publications caught the attention of Hermite, who was able to eventually 
secure a university position for Stieltjes in Toulouse. A life-long friendship evolved between these 
two great men, of which two volumes of their correspondence (see Hermite and Stieltjes [1905]) 
gives vivid testimony (and still makes fascinating reading). Stieltjes is best known for his work on 
continued fractions and the moment problem, which, among other things, led him to invent a new 
concept of integral, which now bears his name. He died of tuberculosis at the young age of 38. 
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b n 
= / IFO) = Bana fst)|w(odt + Yo wel Ate) — Bon fs te)| 
a k=1 


b n 
s IFO) ~ Pan-i1(f; ‘JIloo | w(t)dt + S| 3 


a k=1 


Here, the positivity of the weight w; has been used crucially. Noting that 
n b 
Yo we = i w(t)dt = fo (cf. (3.36), 
k=1 @ 


we thus conclude 
[En(S)| < 2bMoll f — Pan-illoo > 0 as n > ov. 


(d 


wm 


The nodes in the n-point Gauss formula separate those of the (1 + 1)-point 
formula. (Cf. Ex. 33). 

The next property forms the basis of an efficient algorithm for computing 
Gaussian quadrature formulae. 
Let a, = ax(w), Be = Bx (w) be the recurrence coefficients for the orthogonal 
polynomials 2; (-) = 2% (-;w); that is (cf. Chap. 2, Sect. 2.1.4(2)), 


(e 


Na 


Teei(t) = (t — ag) a(t) — Bemr-1(t), k=0,1,2..., 
mo(t)=1, m1 (t) = 0, (3.45) 


with Bo (as is customary) defined by Bo = f?w(a)de (= lo). The nth-order 
Jacobi matrix for the weight function w is a tridiagonal symmetric matrix 
defined by 


a Vpi 0 
VBi om Bo 


Jn = J, (w) = V2 


5 < Vv Bn-1 
0 af Bai Qn-1 


Then the nodes ¢, are the eigenvalues of J, (cf. Ex. 44(a)), 


Five = tere, vine = 1, FH 1,3.058, (3.46) 
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and the weights w;, expressible in terms of the first component vz, of the 
corresponding (normalized) eigenvectors v; by (cf. Ex. 44(b)) 


we = Por Ge FH 2, qn: (3.47) 


Thus, to compute the Gauss formula, we must solve an eigenvalue/eigenvector 
problem for a symmetric tridiagonal matrix. This is a routine problem in 
numerical linear algebra, and very efficient methods (the QR algorithm, for 
example) are known for solving it. 

(f) Markov® observed in 1885 that the Gauss quadrature formula can also be 
obtained by Hermite interpolation on the nodes t, each counting as a double 
node, if one requires that after integration all coefficients of the derivative 
terms should be zero (cf. Ex. 34). An interesting consequence of this new 
interpretation is the following expression for the remainder term, which follows 
directly from the error term of Hermite interpolation (cf. Chap. 2, Sect. 2.2.7) 
and the Mean Value Theorem of integration, 


(2n) b 
E,(f) = 2 i [mn(t:w)2w(t)dt, a<&<b. (3.48) 


Qn)! J, 


Here, z,(-; w) is the orthogonal polynomial, with leading coefficient 1, relative 
to the weight function w. It is assumed, of course, that f € C?"[a, 5]. 


We conclude this section with a table of some classical weight functions, their 
corresponding orthogonal polynomials, and the recursion coefficients a;,, Bx for 
generating the orthogonal polynomials as in (3.45) (see Table 3.1 on the next page). 
We also include the standard notations for these polynomials (these usually do not 
refer to the monic polynomials). Note that the recurrence coefficients a; are all 
zero for even weight functions on intervals symmetric with respect to the origin 
(cf. Chap. 2, Sect. 2.1.4(2)). For Jacobi polynomials, the recursion coefficients are 
explicitly known, but the formulae are a bit lengthy and are not given here (they can 
be found, e.g., in Gautschi [2004], Table 1.1). 


8 Andrey Andreyevich Markov (1856-1922), a student of Chebyshev, was active in St. Petersburg. 
While his early work was in number theory and analysis, he is best known for his work in 
probability theory, where he studied certain discrete random processes now known as Markov 
chains. 


°Carl Gustav Jacob Jacobi (1804-1851) was a contemporary of Gauss and with him one of the most 
important 19th-century mathematicians in Germany. His name is connected with elliptic functions, 
partial differential equations of dynamics, calculus of variations, celestial mechanics; functional 
determinants also bear his name. In his work on celestial mechanics he invented what is now 
called the Jacobi method for solving linear algebraic system. 


!0Edmond Laguerre (1834-1886) was a French mathematician active in Paris, who made essential 
contributions to geometry, algebra, and analysis. 


178 


Table 3.1 Classical orthogonal polynomials 
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w(t) [a,b] Orth. Pol. 
1 [-1,1] Legendre 
(1—12)72 [-1,1] Chebyshev #1 


iS) 
nie 


(1 —t?) [-1,1] Chebyshev #2 


d—t)7( +1) [1,1] Jacobi? 
a>-l1,p>-1 

te" ,a>-—l1 [0, co] Laguerre!” 
ee [-co, 00] Hermite 


Notation 
P,, 


Th 


Un 


af” 


ig 


HA, 


Ok 
0 


known 


2k +a+1 


Bx 
2 (k =0) 
(4—k-)“(k > 0) 


Td +a) (k =0) 
k(k + a) (k > 0) 


Ja (k = 0) 
Lett =) 


ee ee 


3.2.4 Some Applications of the Gauss Quadrature Rule 


In many applications, the integrals to be computed have the weight function w 
already built in. In others, one has to figure out for oneself what the most appropriate 
weight function should be. Several examples of this are given in this section. We 
begin, however, with the easy exercise of transforming the Gauss—Jacobi quadrature 
rule from an arbitrary finite interval to the canonical interval [-1,1]. 


(a) The Gauss—Jacobi formula for the interval [a,b]. We assume [a,b] a finite 
interval. What is essential about the weight function in the Jacobi case is the 
fact that it has an algebraic singularity (with exponent @) at the right endpoint, 
and an algebraic singularity (with exponent f) at the left endpoint. The integral 


in question, therefore, is 


b 
/ (b — x)*(x — a)’ g(x)dx. 


A linear transformation of variables 


1 1 1 
x= —(b-—a)t+ 5 +a), dx= 5 (6 — ajdt, 


2 
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maps the x-interval [a,b] onto the t-interval [-1,1], and the integral becomes 


ya eT) , 
a (5e-aa-9) (5e-a +1) 


xg (50 —a)t+ tb + «)) 5 —a)dt 


l at+pB+l al r 
2 (5e-«) [io e408 fae, 


where we have set 
1 1 
fi) = &(j0-ay+ sb +4)), ere 


The last integral is now in standard form for the application of the Gauss—Jacobi 
quadrature formula. One thus finds 


b 
/ (b — x)"(x — a) g(x)dx 


1 a+f+1 n 1 1 
=(50-0) Ye (56-af + 5644) + Exe, 


k=1 
(3.49) 


where i Wk are the (standard) Gauss—Jacobi nodes and weights. Since with g, 
also f is a polynomial, and both have the same degree, the formula (3.49) is 
exact for all g € Poy_). 


(b) Iterated integrals. Let I denote the integral operator 


wm 


(g)(t) = i e(z)ar, 


and I? its pth power (the identity operator if p = 0). Then 


1 
ute = [ (I? g)(t)dt 
0 


is the pth iterated integral of g. It is a well-known fact from calculus that an 
iterated integral can be written as a simple integral, namely, 


(= 
p! 


1 
(Pt g)(1) = i g(tdt. (3.50) 
0 


Thus, to the integral on the right of (3.50) we could apply any of the standard 
integration procedures, such as the composite trapezoidal or Simpson’s rule. 
If g is smooth, this works well for p relatively small. As p gets larger, the 
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factor (1 — t)? becomes rather unpleasant, since as p — oo it approaches the 
discontinuous function equal to | at t = 0, and O elsewhere on (0,1]. This 
adversely affects the performance of any standard quadrature scheme. However, 
noting that (1 — rf)? is a Jacobi weight on the interval [0,1], with parameters 
a = p, B = 0, we can apply (3.49); that is, 


1 
(?t!g)(1) = aah oats (50 + ») + En(g), (3.51) 


and get accurate results with moderate values of n, even when p is quite large 
(cf. MA 5). 
(c) Integration over R. Compute i F(x)dx, assuming that, for some a > 0, 


F(x) ~e™ , x > too. (3.52) 


Instead of a weight function we are given here information about the asymptotic 
behavior of the integrand. It is natural, then, to introduce the new function 


fO®=e F (= (3.53) 


5} 


which tends to 1 as t + -too. The change of variables x = t/./a then gives 


[- Fexydx = — f F(+ =) a = = fs eo f(t)dt. 


The last integral is now in a form suitable for the application of the Gauss— 
Hermite formula. There results 


H 
iz F(x)dx = —- aM pete F r (+) + Elf), 3.54) 


where ae wi are the Gauss—Hermite nodes and weights. The remainder £,, (f ) 
vanishes whenever f is a polynomial of degree <2n — 1; that is, 


F(x) = e™ p(x), DP € Pon-1. 


Since the coefficients in (3.54) involve the products wi exp((t;')"), some tables 
of Gauss—Hermite formulae also provide these products, in addition to the nodes 
and weights. 

(d) Integration over R4. Compute iF (x)dx, assuming that 


Pp = 
F(x) ~ xP asxl0O (p>-l), (3.55) 
x4 as x > 00 (¢q > 1). 
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Similarly as in (c), we now define f by 


F(x) = f(x), x ER, (3.56) 


P 
(1+ x)jPta 
so as to again have f(x) — 1 as x | 0 and x —> ov. The change of variables 


1+t 2dt 
eee dx = ————_— 


a Teg’ (1 —t) 


then yields 


oo Leja 2 yee 1+r\  2dt 
[Fees f (=) =), ‘ a) a 


1 [ 
=——— | (1-1) 7(1+ 2)? g(a)dr, 
Qprq-l 4 


where 


s(t) =f (75). 


This calls for Gauss—Jacobi quadrature with parameters a = q — 2, B = p: 


fore) 1 n 
: FQ)dx = a (> welt.) + F.()) 
k=1 


It remains to re-express g in terms of f, and f in terms of F’, to obtain the final 


formula 
ad “ y J\-P I\-4 +h 
F(x)dx =2) > wi (1+) "(l-4) *F ‘ 
k=1 1h 
4 2-P I Fite). (3.57) 


This is exact whenever g(t) is a polynomial of degree <2n — 1, for example a 
polynomial of that degree in the variable 1 — t; hence, since f(x) = g (34). 
for any F(x) of the form 


xP 


18 Tye 


A=0,1,2,...,2n—1. 
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3.2.5 Approximation of Linear Functionals: Method 
of Interpolation vs. Method of Undetermined Coefficients 


Up until now, we heavily relied on interpolation to obtain approximations to 
derivatives, integrals, and the like. This is not the only possible approach, however, 
to construct approximation formulae, and indeed not usually the simplest one. 
Another is the method of undetermined coefficients. Both approaches can be 
described in a vastly more general setting, in which a given linear functional is to be 
approximated by other linear functionals so as to have exactness on a suitable finite- 
dimensional function space. Although both approaches yield the same formulae, the 
mechanics involved are quite different. 

Before stating the general approximation problem, we illustrate the two methods 
on some simple examples. 


Example. Obtain an approximation formula of the type 


al 
/ Sf (x)dx ~ a; f(0) + a2 fC). (3.58) 
0 


Method of interpolation. Instead of integrating f, we integrate the (linear) 
polynomial interpolating f at x = 0 and x = 1. This gives 


1 1 
[ feoar~ [ ni(fio.tinax 
0 0 


1 
= [ fa-9f0) + falar = 5/0) + £0) 


the trapezoidal rule, to nobody’s surprise. 

Method of undetermined coefficients. We simply require (3.58) to be exact for all 
linear functions. This is the same as requiring equality in (3.58) when f(x) = 1 
and f(x) = x. (Then by linearity, one gets equality also for f(x) = co + c,x for 
arbitrary constants co, c;.) This immediately produces the linear system 


l=a,+a@, 
1 

— = >, 

2 2 


hence adj = a) = s as before. 


Example. Find a formula of the type 


1 1 
/ x f(x)dx % a; f(0) + ao Ff (x)dx. (3.59) 
0 
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Such a formula may come in handy if we already know the integral on the right and 
want to use this information, together with the value of f at x = 0, to approximate 
the weighted integral on the left. 


Method of interpolation. “Interpolation” here means interpolation to the given 
data — the value of f at x = 0 and the integral of f from 0 to 1. We are thus 
seeking a polynomial p € P, such that 


1 1 
p0) = f(0), / p(x)dx = if fxdax, (3.60) 


which we then substitute in place of f in the left-hand integral of (3.59). If we let 
P(x) = co + c1x, the interpolation conditions (3.60) give 


co = f (0), 


1 1 
cot xc) = i; FS (x)dx; 
a 


hence 
1 
c= F00), «1 = 2) f Foax— fr}. 
0 
Therefore, 
1 1 1 
ij Vx f (x)dx ~ | J/x p(x)dx = Jx(co + e1x)dx 
0 0 0 
1 
= ent Fe = 2/0) + 2 2} f fodx - ro . 
that is, 


1 2 4 1 
/ Vx f (x)dx —z5fO ++ =) St (x)dx. (3.61) 


Method of undetermined coefficients. Equality in (3.59) for f(x) =1 and 
(x) = x immediately yields 


- 
~=a a, 
3 1 2 
2 _ Pes 
5 702 
hence a; = 4, a, = 4, the same result as in (3.61), but produced incomparably 


faster. 

In both examples, we insisted on exactness for polynomials (of degree 1). In 
place of polynomials, we could have chosen other classes of functions, as long as 
we make sure that their dimension matches the number of “free parameters.” 
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The essence of the two examples consists of showing how a certain linear 
functional can be approximated in terms of other (presumably simpler) linear 
functionals by forming a suitable linear combination of the latter. We recall that 
a linear functional L on a function space F is a map 


L: FOR, (3.62) 
which satisfies the conditions of additivity and homogeneity: 
Lif+g)=Lft+Lg, al fi geF, (3.63) 


and 
L(cf)=cLf, all cE R, f €F. (3.64) 


The function class F, of course, must be a linear space, that is, closed under addition 
and multiplication by a scalar: f,g € F implies f + g € F, and f ¢€ F implies 
cf €F foranyc eR. 

Here are some examples of linear functionals and appropriate spaces F on which 
they live: 


(a) Lf = f(0); F={f: f is defined at x = 0}. 

(b) Lf = f"(4); F={f: f has a second derivative at x = 5}. 

(c) Lf = ih Ff (x)dx; F = C[0, 1] (or, more generally, f is Riemann integrable, 
or Lebesgue integrable). 

(d) Lf = i J (x)w(x)dx, where w is a given (integrable) “weight function;” F = 
C(0, 1]. 

(e) Any linear combination (with constant coefficients) of the preceding linear 
functionals. 


Examples of nonlinear functionals are 

(a) Kf = FOI. 

(b') Kf = ip [f(x)Pdx, and so on. 
0 


We are now ready to formulate the general approximation problem: given a linear 
functional L on F (to be approximated), n special linear functionals L;, Lo,..., Ln 
on F and their values (the “data”) ¢; = L; f,i = 1,2,...,n, applied to some 
function f, and given a linear subspace ® C F with dim ® = n, we want to find an 
approximation formula of the type 


Lf» laiLif (3.65) 


i=1 


that is exact (i.e., holds with equality) whenever f € ®. 
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It is natural (since we want to “interpolate’’) to make the following. 
Assumption: the “interpolation problem” 


find g € ® suchthat Lig =5;, i = 1,2,...,n, (3.66) 
has a unique solution, g(-) = g(s; -), for arbitrary s = [s,,52,...,5,]'. 
We can express our assumption more explicitly in terms of a given basis 
(1, 2,-+--,@n Of ® and the associated “Gram matrix” 
Ligi Ligo +++ Lign 
L L es Lo@y 
G=ii¢je)| -" 28m | px (3.67) 
LnG1 Ln (2 eee Ln Pn 


What we require is that 
detG £0. (3.68) 


It is easily seen (cf. Ex. 49) that this condition is independent of the particular choice 
of basis. To show that unique solvability of (3.66) and (3.68) are equivalent, we 
express @ in (3.66) as a linear combination of the basis functions, 


g => 59; (3.69) 


by the linearity of the functionals L;, can be written in the form 


n 
) cj Lig; =s;, 1 =1,2,...,n; 
j=l 


that is, 
Ge =s, c= [c1,c2,...,€n]', 8 = [81,52,..-,8n]'- (3.70) 


This has a unique solution for arbitrary s if and only if (3.68) holds. 
Method of interpolation. We solve the general approximation problem “by 
interpolation,” 


Lf =~ Lo(é;-), €= [€1,2,...,4:]', € = Lif. (3.71) 
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In other words, we apply L not to f, but to g(£; - ) — the solution of the interpolation 
problem (3.66) in which s = £, the given “data.” Our assumption guarantees that 
y(€; -) is uniquely determined. In particular, if f € ®, then (3.71) holds with 
equality, since trivially g(£; -) = f(-) in this case. Thus, our approximation (3.71) 
already satisfies the exactness condition required for (3.65). It remains only to show 
that (3.71) indeed produces an approximation of the form (3.65). To do so, observe 
that the interpolant in (3.71) is 


g(t; -) = ejgi(-), 
j=l 

where the vector ¢ = [c1,C2,...,Cy]" satisfies (3.70) with s = @, 

Ge=, €=[L1f, Lof,...,Lnf]'. 
Writing 

A= Lie fH cant A= Pattee Ag) (3.72) 
we have by the linearity of L, 
Lo(t; -) = > ¢;Lo; =ATe = ATE =[(GT)12] €; 
j=l 


that is, 


Loti \= > aly f, a= [ai,02,.2.5a,]" =(G") "a. (3.73) 


i=l 


Method of undetermined coefficients. Here, we determine the coefficients a; in 
(3.65) such that equality holds for all f € ®, which, by the linearity of both L and 
L; is equivalent to equality for f = gy, f = g,..., f = Gn; that is, 


ya g = Lg, i=1,2,...,n, 


or, by (3.72), 


n 


Yo ajLjgi =A;, C= Ws Doses AS 


j=l 


Evidently, the matrix of this system is G", so that 


a = [a1,42,...,an]' = (G")"2, 


3.2 Numerical Integration 187 


in agreement with (3.73). Thus, the method of interpolation and the method of 
undetermined coefficients are mathematically equivalent — they produce exactly the 
same approximation. 


3.2.6 Peano Representation of Linear Functionals 


It may be argued that the method of interpolation, at least in the case of polynomials 
(i.e., ® = Py), is more powerful than the method of undetermined coefficients 
because it also yields an expression for the error term (if we carry along the 
remainder term of interpolation). The method of undetermined coefficients, in 
contrast, generates only the coefficients in the approximation and gives no clue as 
to the approximation error. 

There is, however, a device due to Peano!! that allows us to discuss the error after 
the approximation has been found. The point is that the error, 


Ef =Lf -—) aLif, (3.74) 
i=l 
is itself a linear functional, one that annihilates all polynomials, say, of degree d, 
Ep=0, all pePy. (3.75) 


Now suppose that F consists of all functions f having a continuous (r + 1)st 
derivative on the finite interval [a,b], F = C’t'|a,b], r < d. Then by Taylor’s 
theorem with the remainder in integral form , we have 


! (r) 
fo) =f @-a FO 4.4 @-ay 
a = f eon roar, (3.76) 
r! Ja 


The last integral can be extended to t = b if we replace x — t by 0 whent > x: 


x—-t ifx—t>0, 


anf = 
ies ve ifx—1 <0. 


'lGiuseppe Peano (1858-1932), an Italian mathematician active in Turin, made fundamental 
contributions to mathematical logic, set theory, and the foundations of mathematics. General 
existence theorems in ordinary differential equations also bear his name. He created his own 
mathematical language, using symbols of the algebra of logic, and even promoted (and used) a 
simplified Latin (his “latino”) as a world language for scientific publication. 
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Thus, 


x b 
i) (ay fe @ar = f (x — 1), FOF (ade. (3.77) 


Now by applying the linear functional E to both sides of (3.76), with the integral 
written as in (3.77), yields by linearity of E and (3.75) 


b 
Ef= aE fio-msmoat 
lf a 
=a [E(x — O51 °F? Mat, 


provided the interchange of E with the integral is legitimate. (For most functionals 
it is.) The subscript x in E,,) is to indicate that E acts on the variable x (not f). 
Defining 


1 
K,(t) = a Ew@—t,, teR, (3.78) 


we thus have the following representation for the error E, 
b 
Ef =] K,(t) ft dt. (3.79) 


This is called the Peano representation of the functional E’, and K, the rth Peano 
kernel for E. 

If the functional E makes reference only to values of x in [a,b] (e.g., Ef may 
involve values of f or of a derivative of f at some points in [a, b], or integration 
over [a, b]), then it follows from (3.78) that K,(t) = 0 fort ¢ [a, b] (cf. Ex. 50). In 
this case, the integral in (3.79) can be extended over the whole real line. 

The functional F is called definite of order r if its Peano kernel K,, is of the same 
sign. (We then also say that K, is definite.) For such functionals E’, we can use the 
Mean Value Theorem of integration to write (3.79) in the form 


b 
Ef = aac) K,(t)dt, a<t <b (E definite of order r). 
The integral on the right is easily evaluated by putting f(t) = t’t!/(r + 1)! in 
(3.79). This gives 


r+l 


Ef =eai:f"*?(), e414 = E ——— 
f= eri f(T), Cr+ +! 


(E definite of orderr). (3.80) 


Since e,+1; 4 0 by definiteness of K,, we must have r = d by virtue of (3.75), 
and so 


IES | < lea+il IFT loo. (3.81) 
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Conversely, a functional £ satisfying (3.80) with e,+, 4 0 is necessarily definite of 
order r (see Ex. 51). For nondefinite functionals E,, we must estimate by 


b 
Ef SS? leo / [K,(0)|dt, (3.82) 


which, in view of the form (3.78) of K,, can be rather laborious. 
As an example, consider the formula obtained in (3.61); here 


1 
Ef = i VE flax +2 fO)- 4 sf fade, 


and Ef = 0 forall f € P; (d = 1). Assuming that f € C’[0, 1] (r = d = 1), we 
thus have by (3.79), 


1 
Ef =| Kit) f"@)dt, Kit) = Eq (x —t)+. 


Furthermore, 


1 
Ua eee 15 3h (x —t)4dx 


2 2 ; 
= =x3| =t- 2x2 —t(1—-t) 
5 t 3 t 
== (1-1) st (1-8) -Za-)4 F109 
5 
4 5 2 2 


Now the function in parentheses, say, q(t), satisfies g(0) = 1, q(1) = 0, q(t) = 
—3(1—1!/2) < 0 for0 <t < 1. There follows g(t) > 0 on [0,1], and the kernel K; 
is (positive) definite. Furthermore, 


2 1 2 2 4 Il 532) 
a=£(5) = We tet a [ye 
0 0 
2 
7 


a ee 
— x2 —- —--xXx 
3 5 Oo |g 
_t 2. 1 
Fs 15. 105" 
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so that finally, by (3.80), 


1 
Ef = 76 f": 0<r1t<l. 


3.2.7 Extrapolation Methods 


Many methods of approximation depend on a positive parameter, say, 4, which 
controls the accuracy of the method. As A | 0, the approximations typically 
converge to the exact solution. An example of this is the composite trapezoidal rule 
of integration, where / is the spacing of the quadrature nodes. Other important ex- 
amples are finite difference methods in ordinary and partial differential equations. In 
practice, one usually computes several approximations to a solution, corresponding 
to different values of the parameter /. It is then natural to try “extrapolating to the 
limit h = 0,” that is, constructing a linear combination of these approximations that 
is more accurate than either of them. This is the basic idea behind extrapolation 
methods. We apply it here to the composite trapezoidal rule, which gives rise to an 
interesting and powerful integration technique known as Romberg integration. 

To develop the general principle, suppose the approximation in question is A(/), 
a scalar-valued function of h, and thus A(0) the exact solution. The approximation 
may be defined only for a set of discrete values of h, which, however, have h = 0 
as a limit point (e.g.,h = (b —a)/n,n = 1,2,3,..., in the case of the composite 
trapezoidal rule). We call these admissible values of h. When in the following we 
write h — 0, we mean that / goes to zero over these admissible values. About the 
approximation A(/), we first assume that there exist constants a;, a2 independent 
of h, and two positive numbers p, p’ with p’ > p, such that 


A(h) = ay tah? + O(h”), h—->0, a; £0. (3.83) 


The order term here has the usual meaning of a quantity bounded (for all sufficiently 
small /) by a constant times h P’, where the constant does not depend on h. We only 
assume the existence of such constants do, a1; their values are usually not known. 
Indeed, dy) = A(0), for example, is the exact solution. The value p, on the other 
hand, is assumed to be known. 

Now let g < 1 bea fixed positive number, and h and q~'h admissible parameters. 
Then we have 


A(h) = ay + ayh? + O(h”), 
A(q7'h) = ay +.aiq7P?h? + O(h”), 


>0 


Eliminating the middle terms on the right, we find 


A(h) — A(q7'h) 


ag = A(0) = A(h) + r= O(h”’), h—0. 


3.2 Numerical Integration 191 


Thus, from two approximations, A(h), A(q~'h), whose errors are both O(h”), we 
obtain an improved approximation, 


A(h) — A(q"'h) 


Aimpr(A) = A(h) qP — I 


; (3.84) 


with a smaller error of O(h”’). The passage from A to Aimpr is called Richardson’ 
extrapolation. 

If we want to repeat this process, we have to know more about the approximation 
A(h); there must be more terms to be eliminated. This leads to the idea of an 
asymptotic expansion: we say that A(/) admits an asymptotic expansion 


A(h) = dy + ayh?! + anh? +--+, O< pp<pypo<-:-, hoO (3.85) 
(with coefficients a; independent of h), if foreach k = 1,2,... 
A(h) — (ap + ayh?! +--+ + aghP') = O(h?+'), hh 0. (3.86) 


We emphasize that (3.85) need not (and in fact usually does not) converge for any 
fixed h > 0; all that is required is (3.86). If (3.86) holds only for finitely many k, 
say, k = 1,2,...,K, then the expansion (3.85) is finite and is referred to as an 
asymptotic approximation to K + 1 terms. 

It is now clear that if A(/) admits an asymptotic expansion (or approximation), 
we can successively eliminate the terms of the expansion exactly as we did 
for a 2-term approximation, thereby obtaining a (finite or infinite) sequence of 
successively improved approximations. We formulate this in the form of a theorem. 


Theorem 3.2.2. (Repeated Richardson extrapolation). Let A(h) admit the asymp- 
totic expansion (3.85) and define, for some fixed positive q < 1, 


A\(h) = A(h), 


Ax(h) — Ax (q7"h) 


Ax4i(h) = Ag(h) + a 


k= 126. (3.87) 


1 ewis Fry Richardson (1881-1953), born, educated, and active in England, did pioneering work 
in numerical weather prediction, proposing to solve the hydrodynamical and thermodynamical 
equations of meteorology by finite difference methods. Although this was the precomputer age, 
Richardson envisaged that the job could be done “with tier upon tier of human computers fitted 
into an Albert Hall structure” (P.S. Sheppard in Nature, vol. 172, 1953, p. 1127). He also did a 
penetrating study of atmospheric turbulence, where a nondimensional quantity introduced by him 
is now called “Richardson’s number.” At the age of 50, he earned a degree in psychology and 
began to develop a scientific theory of international relations. He was elected a Fellow of the Royal 
Society in 1926. 
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Ao,o computing stencil 
A1o Ai [ 

A200 Ao Ax» 

A3,0 A3,1 A3,2 A3,3 


Fig. 3.3, Extrapolation algorithm 


Then for eachn = 1,2,3,..., A,(h) admits an asymptotic expansion 
A, (A) = a+ al pr + al) prot +---, h>0, (3.88) 
with certain coefficients a®, Ce ... not depending on h. 


We remark that if (3.85) is only an approximation to K + | terms, then the 
recursion (3.87) is applicable only for k = 1,2,..., K, and (3.88) holds for n = 
1,2,..., K, whereas form = K + 1 one has Agn4,(A) = do + O(APK+!). 

It is easily seen from (3.87) that Ax+,(4) is a linear combination of A(h), 
A(q7'h), A(q77h),..., A(q7~*h), where it was tacitly assumed that h, g~'h, 
q 7h,... are admissible values of the parameter. 

We now rework Theorem 3.2.2 into a practical algorithm. To do so, we assume 
that we initially compute A(/) for a succession of parameter values 


ho, qho, q’ho,... (q<D, 
all being admissible. Then we define 
Amk = Ax+i(g™ho), m,k =0,1,2,.... (3.89) 


The idea behind (3.89) is to provide two mechanisms for improving the accuracy: 
one is to increase m, which reduces the parameter h, the other is increasing k, which 
engages a more accurate approximation. Ideally, one employs both mechanisms 
simultaneously, which suggests that the diagonal entries A, are the ones of most 
interest. 

Putting h = q'"ho in (3.87) produces the extrapolation algorithm 


An k-1 — Am=1 k-1 
Am.k = Amk-1 ar = , N= k 2 1, 
qd Pk—] 


Amo = A(q’"ho). (3.90) 


This allows us to compute the triangular scheme of approximations shown in 
Fig. 3.3. 
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Each entry is computed in terms of its neighbors horizontally, and diagonally above, 
to the left, as indicated in the computing stencil of Fig. 3.3. The entries in the first 
column are the approximations initially computed. The generation of the triangular 
scheme, once the first column has been computed, is extremely cheap, yet can 
dramatically improve the accuracy, especially down the diagonal. If (3.85) is a finite 
asymptotic approximation to K +1 terms, then the array in Fig. 3.3 has a trapezoidal 
shape, consisting of K + 1 columns (including the one with k = 0). 

We now apply (3.90) and Theorem 3.2.2 to the case of the composite trapezoidal 
rule, 


b 
a= f F(x)dx, (3.91) 


n—1 


1 1 b-a 
A(h) =hy 5 fla) +) FG +kh) +5 fb)p, h=——.  B.92) 


The development of an asymptotic expansion for A(/) in (3.92) is far from trivial. 
In fact, the result is the content of a well-known formula due to Euler!? and 
Maclaurin. '* 

Before stating it, we need to define the Bernoulli!> numbers B).; these are the 
coefficients in the expansion 


z Be x 


‘37 eonhard Euler (1707-1783) was the son of a minister interested in mathematics who followed 
lectures of Jakob Bernoulli at the University of Basel. Euler himself was allowed to see Johann 
Bernoulli on Saturday afternoons for private tutoring. At the age of 20, after he was unsuccessful 
in obtaining a professorship in physics at the University of Basel, anectodically because of a lottery 
system then in use (Euler lost), he emigrated to St. Petersburg; later, he moved on to Berlin, and 
then back again to St. Petersburg. Euler unquestionably was the most prolific mathematician of 
the 18th century, working in virtually all branches of the differential and integral calculus and, in 
particular, being one of the founders of the calculus of variations. He also did pioneering work in 
the applied sciences, notably hydrodynamics, mechanics of deformable materials and rigid bodies, 
optics, astronomy, and the theory of the spinning top. Not even his blindness at the age of 59 
managed to break his phenomenal productivity. Euler’s collected works are still being edited, 71 
volumes having already been published. 


'4Colin Maclaurin (1698-1764) was a Scottish mathematician who applied the new infinitesimal 
calculus to various problems in geometry. He is best known for his power series expansion, but 
also contributed to the theory of equations. 

'5Jakob Bernoulli (1654-1705), the elder brother of Johann Bernoulli, was active in Basel. He was 
one of the first to appreciate Leibniz’s and Newton’s differential and integral calculus and enriched 
it by many original contributions of his own, often in (not always amicable) competition with his 
younger brother. He is also known in probability theory for his “law of large numbers.” 
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It is known that 


1 1 

Bo=1, Bo= Bys=—-—, Be=—,..., 
30 42 (3.94) 

1 
By ah Bz; = Bs = =0 
Furthermore, 
k! 
|Be| ~ 2 Gn as k (even) > oo. (3.95) 


We now state without proof!® 


Theorem 3.2.3. (Euler-Maclaurin formula). Let A(h) be defined by (3.92), where 
f <¢ C°X*"la, bl for some integer K > 1. Then 


A(h) = ag + ayh? + anh* +--+» + axh?* + O(h?**)), h > 0, (3.96) 


'6@ heuristic derivation of the formal expansion (3.96) and (3.97), very much in the spirit of Euler, 
may proceed as follows. We start from Taylor’s expansion (where x, x +h € [a, b]) 


Co nk DK 
for+h- fe = fe =e? -yfew, D= = 


k=1 
Solving formally for f(x), we get, using (3.93), 


[oe} 


fx) =  — DF @ +h) — fl = YO = (hd LF (x +h) — f(x)); 


r=0 


that is, in view of (3.94), 


FO) = [bY = 5 + DY Loe +) = FOO) 


= (hD) [f(x +h)— f(x)] 5 Ufo h) font SHH Le ea ge o) 


kK p2k— Ly fA ey + py — fA%-D(x)]. 


x+th 
=; / f(t)dt sf +h) f(x)] 4 Be 


h Gk) 


Therefore, bringing the second term to the left-hand side, and multiplying through by h, 


Fp ae = fe GO) 


h x+ 
Hroe+n+ ros f oor 


Now letting x = a + ih and summing over i from 0 to n — 1| gives 


a= fi fioar+ 3 Sef) — FM), 
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where dg is given by (3.91) and 


_ Bok 5 pa- = _ 
a: = Bey [fer (b) = FOP). eH 1,2... (3.97) 


Thus, under the assumption of Theorem 3.2.3, we have in (3.96) an asymptotic 
approximation to K + 1 terms for A(/), with 


Pe = 2k, k =1,2,...,K, pra: =2K +1, (3.98) 
and an asymptotic expansion, if K = oo (i.e., if f has continuous derivatives of 
any order). Choosing ho = b—a,q = 5 in (3.89), which is certainly permissible, 


the scheme (3.90) becomes 


Am k-1 = Am=1k-1 
4k —1 


Amk = Amk-1 “+f , M= k > 1, 


and is known as the Romberg integration scheme. The choice g = 5 is particularly 
convenient here, since in the generation of A,,9 we can reuse function values already 
computed (cf. MA 6). 

There is an important instance in which the application of Romberg integration 
would be pointless, namely, when all coefficients a;, ad2,... in (3.96) are zero. 
This is the case when f is periodic with period b — a, and smooth on R. Indeed, 
we already know that the composite trapezoidal rule is then exceptionally accurate 
(cf. Sect. 3.2.1, (3.32)). 


3.3 Notes to Chapter 3 


Section 3.1. Here we are dealing strictly with numerical differentiation, that is, 
with the problem of obtaining approximations to derivatives that can be used for 
numerical evaluation. The problem of symbolic differentiation, where the goal is 
to obtain analytic expressions for derivatives of functions given in analytic form, is 
handled by most computer algebra systems such as Mathematica and Macsyma, 
and we refer to texts in this area cited in Sect.P3.2 under Computer Algebra. 
Another important approach to differentiation is what is referred to as automatic 
differentiation. Here, the objective is to create a program (i.e., a piece of software) 
for computing the derivatives of a function given in the form of a program or 
algorithm. Notable applications are to optimization (calculation of Jacobian and 
Hessian matrices) and to the solution of ordinary differential equations by Taylor 
expansion. For an early paper on this subject, see Kedem [1980], for a good 
cross-section of current activity, Griewank and Corliss [1991] and a more recent 
exposition, Griewank and Walther [2008]. Automatic differentiation in the context 
of Matlab object-oriented programming is discussed in Neidinger [2010]. 
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Section 3.1.1. The interpolatory formulae for differentiation derived here are 
analogous to the Newton—Cotes formulae for integration (cf. Sect. 3.2.2) in the sense 
that one fixes n + | distinct nodes x9, x1,...,X, and interpolates on them by a 
polynomial of degree n to ensure polynomial degree of exactness n. In analogy 
to Gauss quadrature formulae, one might well ask whether by a suitable choice of 
nodes one could substantially increase the degree of precision. If xo is the point at 
which f” is to be approximated, there is actually no compelling reason for including 
it among the points where f is evaluated. One may thus consider approximating 
f' (xo) by L( ff: x0,2) = h7! 0"_9 wi f (xo + tA) and choosing the (real) numbers 
w;, t; so that the integer d for which Ef = f’(x9) — L(f;x0,h) = 0, all f € Pa, 
is as large as possible. This, however, is where the analogy with Gauss ends. For 
one thing, the ¢; need to be normalized somehow, since multiplying all t; by a 
constant c and dividing w; by c yields essentially the same formula. One way of 
normalization is to require that |f; — ¢;| > 1 for alli # j. More important, 
the possible improvement over d = n (which is achievable by interpolation) is 
disappointing: the best we can do is obtain d = n+1. This can be shown by a simple 
matrix argument; see Ash et al. [1984]. Among the formulae with d = n +1 (which 
are not unique), one may define an optimal one that minimizes the absolute value 
of the coefficient in the leading term of the truncation error Ef. These have been 
derived for each n in Ash et al. [1984], not only for the first, but also for the second 
derivative. They seem to be optimal also in the presence of noise (form = 2, see Ash 
and Jones [1981, Theorem 3.2.3]), but are still subject to the magnification of noise 
as exemplified in (3.18). To strike a balance between errors due to truncation and 
those due to noise, an appropriate step / may be found adaptively; see, for example, 
Stepleman and Winarsky [1979] and Oliver [1980]. 

For the sth derivative and its approximation by a formula L as in the preceding 
paragraph, where h7! is to be replaced by h~*, one may alternatively wish to 
minimize the “condition mumber” )~;_, |w;|. Interestingly, ifn and s have the same 
parity, the optimum is achieved by the extreme points of the nth-degree Chebyshev 
polynomial T,,; see Rivlin [1975] and Miel and Mooney [1985]. 

One can do better, especially for high-order derivatives, if one allows the ¢; and 
w; to be complex and assumes analyticity for f. For the sth derivative, it is then 
possible (cf. Lyness [1968]) to achieve degree of exactness n + s by choosing the 
t; to be the n + | roots of unity; specifically, one applies the trapezoidal rule to 
Cauchy’s integral for the sth derivative (see (3.19) for s = 1). A more sophisticated 
use of these trapezoidal sums is made in Lyness and Moler [1967]. For practical 
implementations of these ideas, and algorithms, see Lyness and Sande [1971] and 


Fornberg [1981]. 
Considering the derivative of a function f on some interval, say, [0, 1], as the 
solution on this interval of the (trivial) integral equation Ie u(t)dt = f(x), one 


can try to combat noise in the data by applying “Tikhonov regularization” to this 
operator equation; this approach is studied, for example, in King and Murio [1986]. 


Section 3.2. The standard text on the numerical evaluation of integrals — simple 
as well as multiple — is Davis and Rabinowitz [2007]. It contains a valuable 
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bibliography of over 1500 items. Other useful texts are Krylov [1962], Brass [1977], 
Engels [1980], and Evans [1993], the last being more practical and application- 
oriented than the others and containing a detailed discussion of oscillatory integrals. 
A broadly based, but concise, text is Krommer and Ueberhuber [1998]. Most 
quadrature rules are designed to integrate exactly polynomials up to some degree 
d, that is, all solutions of the differential equation u“/+!) = 0. One can generalize 
and require the rules to be exact for all solutions of a linear homogeneous differential 
equation of order d + 1. This is the approach taken in the book by Ghizzetti and 
Ossicini [1970]. There are classes of quadrature rules, perhaps more of theoretical 
than practical interest, that are not covered in our text. One such concerns rules, first 
studied by Chebyshev, whose weights are all equal (which minimizes the effects of 
small random errors in the function values), or whose weights have small variance. 
Surveys on these are given in Gautschi [1976a] and Forster [1993]. Another class 
includes optimal quadrature formulae, which minimize, for prescribed or variable 
nodes, the supremum of the modulus of the remainder term taken over some suitable 
class of functions. For these, and their close relationship to “monosplines,” we refer 
to Nikol’skii [1988] or Levin and Girshovich [1979]. 

Symbolic integration is inherently more difficult than symbolic differentiation 
since integrals are often not expressible in analytic form, even if the integrand is an 
elementary function. A great amount of attention, however, has been given to the 
problem of automatic integration. Here, the user is required to specify the limits of 
integration, to provide a routine for evaluating the integrand function, and to indicate 
an error tolerance (absolute, relative, or mixed) and an upper bound for the number 
of function evaluations to be performed. The automatic integrator is then expected 
either to return an answer satisfying the user’s criteria, or to report that the error 
criterion could not be satisfied within the desired volume of computation. In the 
latter event, a user-friendly integrator will offer a best-possible estimate for the value 
of the integral along with an estimate of the error. A popular collection of automatic 
integrators is Quadpack (see Piessens et al. [1983]), and a good description of 
the internal workings of automatic integration routines can be found in Davis and 
Rabinowitz [2007, Chap. 6]. 

For the important (and difficult) problem of multiple integration and related 
computational tools, we must refer to special texts, for example, Stroud [1971], 
Mysovskikh [1981] (for readers familiar with Russian), Sloan and Joe [1994], and 
Sobolev and Vaskevich [1997]. An update to Stroud [1971] is available in Cools 
and Rabinowitz [1993]. Monte Carlo methods are widely used in statistical physics 
and finance to compute high-dimensional integrals; texts discussing these methods 
are Niederreiter [1992], Sobol’ [1994], Evans and Swartz [2000], and Kalos and 
Whitlock [2008]. 

In dealing with definite integrals, one should never lose sight of the many 
analytical tools available that may help in evaluating or approximating integrals. The 
reader will find the old, but still pertinent, essay of Abramowitz [1954] informative 
in this respect, and may also wish to consult Zwillinger [1992a]. 
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Section 3.2.1. The result relating to (3.34), in a slightly different form, is proved 
in Davis and Rabinowitz [2007, p. 209]. Their proof carries over to integrals of the 
form de Ff (x)dx, if one first lets  — oo, and then a > —oo, in their proof. 


Section 3.2.2. The classical Newton—Cotes formulae (3.35) involve equally spaced 
nodes (on [—1, 1]) and weight function w = 1. They are useful only for relatively 
small values of 1, since for large n the weights become large and oscillatory in 
sign. Choosing Chebyshev points instead removes this obstacle and gives rise to 
the Fejér quadrature rule. Close relatives are the Filippi rule, which uses the local 
extreme points of 7,41, and the Clenshaw—Curtis rule, which uses all extreme points 
(including +1) of 7;,-; as nodes. All three quadrature rules have weights that can 
be explicitly expressed in terms of trigonometric functions, and they are all positive. 
The latter has been proved for the first two rules by Fejér [1933], and for the last 
by Imhof [1963]. Formulas with Chebyshev points of the third and fourth kind, 
with or without the endpoints, are studied in Notaris [1997], [1998]. Algorithms for 
accurately computing weighted Newton—Cotes formulae are discussed in Kautsky 
and Elhay [1982] and Gautschi [1997] (see also Gautschi [201 1a, Sect.4.1] for an 
improvement); for a computer program, see Elhay and Kautsky [1987]. 

It is difficult to trace the origin of Theorem 3.2.1, but in essence, Jacobi already 
was aware of it in 1826, and the idea of the proof, using division by the node 
polynomial, is his (Jacobi [1826]). There are other noteworthy applications of 
Theorem 3.2.1. One is to quadrature rules with 2m + 1 points, where 1 of them are 
Gauss points and the remaining n+ | are to be selected, together with all the weights, 
so as to make the degree of exactness as large as possible. These are called Gauss— 
Kronrod formulae (cf. Ex. 19) and have found use in automatic integration routines. 
Interest has focused on the polynomial of degree n + 1 — the Stieltjes polynomial 
— whose zeros are the n + 1 nodes added to the Gauss nodes. In particular, this 
polynomial must be orthogonal to all polynomials of lower degree with respect to 
the “weight function” z,,(t; w)w(t). The oscillatory character of this weight function 
poses intriguing questions regarding the location relative to the Gauss nodes, or even 
the reality, of the added nodes. For a discussion of these and related matters, see the 
surveys in Gautschi [1988] and Notaris [1994]. 

There is a theorem analogous to Theorem3.2.1 that deals with quadrature 
rules having multiple nodes. The simplest one, first studied by Turan [1950], 
has constant multiplicity 2s + 1 (s => 0) for each node; that is, on the right 
of (3.35), there are also terms involving derivatives up to order 2s for each 
node ¢ (cf. Ex. 20 for s = 1). If one applies Gauss’s principle of maximum 
algebraic degree of exactness to them, one is led to define the tf; as the zeros 
of a polynomial of degree n whose (2s + 1)st power is orthogonal to all lower- 
degree polynomials (cf. Gautschi [1981,Sect.2.2.1]). This gives rise to what 
are called s-orthogonal polynomials and to generalizations thereof pertaining to 
multiplicities that vary from node to node; a good reference for this is Ghizzetti and 
Ossicini [1970, Chap. 3, Sect. 3.9]; see also Gautschi [1981, Sect. 2.2] and Chap. 4, 
Sect. 4.1.4. 

Another class of Gauss-type formulae, where exactness is required not only for 
polynomials (if at all), but also for rational functions (with prescribed poles), has 
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been developed by Van Assche and Vanherwegen [1993] and Gautschi [1993]. They 
are particularly useful if the integrand has poles near the interval of integration. One 
can, of course, require exactness for still other systems of functions; for literature 
on this, see Gautschi [1981, Sect. 2.3.3]. Exactness of Gauss formulae for parabolic 
splines is dicussed in Nikolov and Simian [2011]. 


Section 3.2.3. (c) The convergence theory for Gauss formulae on infinite 
intervals is more delicate. Some general results can be found in the book by 
Freud [1971, Chap. 3, Sect. 3.1]. 

(e) The importance for Gauss quadrature rules of eigenvalues and eigenvectors 
of the Jacobi matrix and related computational algorithms was first elaborated by 
Golub and Welsch [1969], although the idea is older. Similar eigenvalue techniques 
apply to Gauss—Radau and Gauss—Lobatto formulae (Golub [1973]), and indeed to 


Gauss—Kronrod formulae as well (Laurie [1997]). 
A list of published tables of Gauss formulae for various classical and nonclassical 


weight functions is contained in Gautschi [1981,Sect.5.4], where one also finds 
a detailed history of Gauss—Christoffel quadrature rules and extensions thereof. 
A table of recurrence coefficients, in particular, also for Jacobi weight functions, 
can be found in the Appendix to Chihara [1978] and in Gautschi [2004, Sect. 1.5]. 
For practical purposes, it is important to be able to automatically generate Gauss 
formulae as needed, even if the Jacobi matrix for them is unknown (and must itself 
be computed). A major first step in this direction is the Fortran computer package 
in Gautschi [1994b] based on earlier work of the author in Gautschi [1982], and 
the more recent Matlab packages OPQ, SOPQ on the Web at the URL cited in 
Gautschi [2004, Preface]. 


Section 3.2.4. Other applications of classical Gaussian quadrature rules, notably to 
product integration of multiple integrals, are described in the book by Stroud and 
Secrest [1966], which also contains extensive high-precision tables of Gauss formu- 
lae. Prominent use of Gaussian quadrature, especially with Jacobi weight functions, 
is made in the evaluation of Cauchy principal value integrals in connection with 
singular integral equations; for this, see, for example, Gautschi [1981, Sect. 3.2]. 
A number of problems in approximation theory that can be solved by nonclassical 
Gaussian quadrature rules are discussed in Gautschi [1996]. 


Section 3.2.7. A classical account of Romberg integration — one that made this 
procedure popular — is Bauer et al. [1963]. The basic idea, however, can be traced 
back to nineteenth-century mathematics, and even beyond. An extensive survey not 
only of the history, but also of the applications and modifications of extrapolation 
methods, is given in Joyce [1971] and supplemented in Rabinowitz [1992]. See 
also Engels [1979] and Dutka [1984] for additional historical accounts. Romberg 
schemes for other sequences of composite trapezoidal rules are discussed in 
Fischer [2002]. 

Richardson extrapolation is just one of many techniques to accelerate the conver- 
gence of sequences. For others, we refer to the books by Wimp [1981], Brezinski 
and Redivo-Zaglia [1991] (containing also computer programs), and Sidi [2003]. 
A book with emphasis on linear extrapolation methods and the existence of related 
asymptotic expansions is Walz [1996]. 
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Exercises and Machine Assignments to Chapter 3 


Exercises 


1. From (3.4)-(3.6) with n = 3 we know that 
f' (x0) =[x0, xi] f + Oo — x1) [X0, ¥1, Xo] f 
+ (0 — x1) (Xo — X2)[%0, 41, X2, x3] f + es, 


LO) 


where e3 = (Xo — X1)(X0 — X2)(X0 — x3) —q 


. Apply this to 


1 1 
Xo = 0, a B= B= 5 


4 
and express f’(0) as a linear combination of f, = f(xx), k = 0,1,2,3, and 
e3. Also, estimate the error e3 in terms of My = mMaXg<,<! (Fem). 

2. Derive a formula for the error term r/ (x) of numerical differentiation analogous 
to (3.5) but for x 4 xo. {Hint: use Chap. 2, (2.116) in combination with Chap. 2, 
Ex. 58.} 

3. Letx;, i = 0,1,...,n, ben + 1 distinct points with H = max) <;<, |x; — Xo| 
small. 


(a) Show that fork = 0,1,...,7 one has 


dk n ; 
ane | [@ - x) = O(H""*) as H>0. 


i=l x=xo 


(b) Prove that 
f™ (xo) =n! [x0, x1... Xn] f + en, 


where 


1 n 
O(H’) if x» =— fe 
(H°) if xo rs > 


= : 
i=1 


O(#) _ otherwise, 


assuming that f is sufficiently often (how often?) differentiable in the 
interval spanned by the x;. {Hint: use the Newton interpolation formula 
with remainder, in combination with Leibniz’s rule of differentiation.} 

(c) Specialize the formula in (b) to equally spaced points x; with spacing h 
and express the result in terms of either the nth forward difference A” fo 
or the nth backward difference V” f, of the values f; = f(x;). {Here, 
Afo = fi- fo, A’ fo = A(Afo) = Afi-Afo = f-2fi + fo, ete., 
and similarly for Vf, V7, and so on.} 
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4. Approximate du/dx|p, in terms of uy = u(Po), uw = u(P1), u2 = u(P2) (see 
figure, where the curve represents a quarter arc of the unit circle). Estimate the 
error. 


5. (a) Use the central difference quotient approximation f’(x) ~ [f(x + A) — 
F(x — h)|/(2h) of the first derivative to obtain an approximation of 
2 


— — (x, y) for a function u of two variables. 


dxdy 


(b) Use Teal expansion of a function of two variables to show that the error 
of the approximation derived in (a) is O(h7). 
6. Consider the integral J = ic |x|dx, whose exact value is evidently 1. Suppose 
I is approximated (as it stands) by the composite trapezoidal rule T(A) with 
h = 2/n,n = 1,2,3,.... 


(a) Show (without any computation) that 7(2/n) = 1 if n is even. 
(b) Determine 7 (2/n) for n odd and comment on the speed of convergence. 


7. Let 


h h 
T(h) = ‘ FQxdx, TH) = 5 (40) + FMI. 


(a) Evaluate I(h), T(h), and E(h) = I(h) — T(h) explicitly for f(x) = x? + 
5/2 


(b) ae for f(x) = x?+x!/?. Explain the discrepancy that you will observe 
in the order of the error terms. 
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. (a 


(a) 


(b) 


. (a) 


(b) 
(c) 


. Let 
20 
0 


(m 


3 Numerical Differentiation and Integration 


Derive the “midpoint rule” of integration 


Xp+h 
i fess = hf (me + 5h) + PO), Xp <E <x th. 


Xk 


{Hint: use Taylor’s theorem centered at x, + sh.} 


Obtain the composite midpoint rule for 7 Ff (x)dx, including the error 
term, subdividing [a, b] into n subintervals of length h = 2. 


Show that the elementary Simpson’s rule can be obtained as follows: 


1 1 
[ vou | pip LO 0ADEe BG). 
-1 -1 


Obtain a formula for the remainder ES(y), assuming y € C*[—1, 1]. 
Using (a) and (b), derive the composite Simpson’s rule for a Ft (x)dx, 
including the remainder term. 

ES(f) be the remainder term of the composite Simpson’s rule for 
f(x)dx using n subintervals (n even). Evaluate E5(f) for f(x) = ei” 
= 0,1,... ). Hence determine for what values of d Simpson’s rule 


integrates exactly (on [0, 27]) trigonometric polynomials of degree d. 


. . : . wv) 
. Estimate the number of subintervals required to obtain i e * dx to 6 correct 


decimal places (absolute error < 5 x 107°) 


(a) 
(b) 


by means of the composite trapezoidal rule, 
by means of the composite Simpson’s rule. 


. Let f be an arbitrary (continuous) function on [0,1] satisfying 


f(x) + fd—--x) =1for0<x <1. 


(a 
(b) 


wm 


(c) 


wm 


(b) 


Show that cS F@)dx = 3. 

Show that the composite trapezoidal rule for computing ie F(x)dx is 
exact. 

Show, with as little computation as possible, that the composite Simpson’s 
rule and more general symmetric rules are also exact. 


Construct a trapezoidal-like formula 


h 
f(x)dx = af(0) + bf(h)+ E(f), 0O<h<xa, 
0 


which is exact for f(x) = cosx and f(x) = sinx. Does this formula 
integrate constants exactly? 
Show that a similar formula holds for /' — g(t)dt. 


Exercises 203 


14. Given the subdivision A of [0,27] into N equal subintervals,0 = x9 < 


17. 


Xp < Xp <++ < Xy-y < Xy = 20, x, = kh, h = 2n/N, and a (27/)- 
periodic function f, construct a quadrature rule for the mth (complex) Fourier 


coefficients of f, 
1 2n : 
— F(xye "dx, 
20 0 
by approximating f by the spline interpolant s\(f; -) from S?(A). Write the 
result in the form of a “modified” composite trapezoidal approximation. { Hint: 
express s;(f; -) in terms of the hat functions defined in Chap. 2, (2.129).} 


. The composite trapezoidal rule for computing i J (x)dx can be generalized to 


subdivisions 


A: O=x9 <x] < X02 <00+ < Xp] < xX, = 1 


of the interval [0,1] in subintervals of arbitrary length Ax; = x;4; —x;,i = 
0,1,...,2 — 1, by approximating 


1 1 
/ F(x)dx / stp was; 
0 0 


where s;(f;-) € S?(A) is the piecewise linear continuous spline interpolating 
ff at x0, X1,-.-5Xn- 


(a) Use the basis of hat functions Bo, By,..., B, to represent s;(f; -) and 
calculate cS si( fi x)dx. 

(b) Discuss the error E(f) = i. Sf (x)dx — i s\(f; x)dx. In particular, find a 
formula of the type E(f) = const- f’’(&), 0 < & < 1, where the constant 
depends only on A. 


. (a) Construct the weighted Newton—Cotes formula 


1 
/ f(x)x*dx = ao f(0) +a, fCI)+ E(f), a>-l. 
0 


Explain why the formula obtained makes good sense. 

(b) Derive an expression for the error term E(f) in terms of an appropriate 
derivative of f. 

(c) From the formulae in (a) and (b) derive an approximate integration formula 
for 4 g(t)t*dt (h > 0 small), including an expression for the error term. 

(a) Construct the weighted Newton—Cotes formula 


1 
i f(x)-xIn(/x)dx & ao f(0) + a; f(A). 


{Hint: use i x nl) daa Oe 1) or S02 
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(b) Discuss how the formula in (a) can be used to approximate is g(t) - 
t In(1/t)dt for small h > 0. {Hint: make a change of variables.} 


. Let s be the function defined by 


(x +1) if —1<x <0, 
s(x) = : 
(l-x) if O<x<1. 

(a) With A denoting the subdivision of [—1, 1] into the two subintervals [—1, 0] 
and [0, 1], to what class S‘ (A) does the spline s belong? 

(b) Estimate the error of the composite trapezoidal rule applied to i s(x)dx, 
when [—1, 1] is divided into n subintervals of equal length h = 2/n and n 
is even. 

(c) What is the error of the composite Simpson’s rule applied to fie s(x)dx, 
with the same subdivision of [—1, 1] as in (b)? 

(d) What is the error resulting from applying the 2-point Gauss—Legendre rule 
to i s(x)dx and i s(x)dx separately and summing? 


. (Gauss—Kronrod rule) Let z,(-;w) be the (monic) orthogonal polynomial of 


degree n relative to a nonnegative weight function w on [a, b], and > its zeros. 


Use Theorem 3.2.1 to determine conditions on w;, Wis i for the quadrature rule 


n+l 


b n 
J feowenar = Df) + wD + En) 
od k=1 k=1 


to have degree of exactness at least 3n +1; thatis, E,(f) = Oforall f € P3,41. 
(Turan quadrature formula) Let w be a nonnegative weight function on [a, b]. 
Prove: the quadrature formula 


b n 
J feoweras = Yh £60) + We FG) + LEO + Enlf) 
a k=1 


has degree of exactness d = 4n — 1 if and only if the following conditions are 
satisfied: 
(a) The formula is (Hermite-) interpolatory; that is, E,(f) = Oif f € P3,-1. 
(b) The node polynomial w, (t) = TZ_,(¢ — t,) satisfies 
b 
/ [o, (t)} p(t)w(t)dt =0 forall p € P,_4. 


{Hint: simulate the proof of Theorem 3.2.1.} 
Consider s > 1 weight functions w(t), o = 1,2,...,5, integers m, such that 
yi mM, =n, ands quadrature rules 


b n 
Qo: f fOwoldt = win fi) + Erol Ps 6 = 2.05 
e k=1 
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22. 


23. 


24. 


which share m common nodes ¢, but have individual weights w;,. State 
necessary and sufficient conditions for Q, to have degree of exactness n + 
m,—1, o = 1,2,...,s, and explain why this is likely to be optimal. 
Consider a quadrature formula of the form 


1 n 
[fede = ao fO) + a1 $0) + Ye Fw) + bof. 
0 


k=1 


(a) Call the formula “Hermite-interpolatory” if the right-hand side is obtained 
by integrating on the left instead of f the (Hermite) interpolation polyno- 
mial p satisfying 


p(0) = f(0), pO) = f’0), pd) = f(). 
ple) = Tite) FSW 2s 


What degree of exactness does the formula have in this case (regardless 
of how the nodes x; are chosen, as long as they are mutually distinct and 
strictly inside the interval [0, 1])? 
(b) What is the maximum degree of exactness expected to be if all coefficients 
and nodes x, are allowed to be freely chosen? 
Show that for the maximum degree of exactness to be achieved, it is 
necessary that {x;} are the zeros of the polynomial z,, of degree n which is 
orthogonal on [0, 1] with respect to the weight function w(x) = x?(1 — x). 
Identify this polynomial in terms of one of the classical orthogonal 
polynomials. 
Show that the choice of the x; in (c) together with the requirement of 
the quadrature formula to be Hermite-interpolatory is sufficient for the 
maximum degree of exactness to be attained. 


(c 


wm 


(d 


wm 


Show that the Gauss—Radau as well as the Gauss—Lobatto formulae are positive 
if the weight function w is nonnegative and not identically zero. {Hint: 
modify the proof given for the Gauss formula in Sect. 3.2.3(b).} What are the 
implications with regard to convergence as n —> oo of the formulae? 

(Fejér, 1933). Let t.,k = 1,2,...,n, be the zeros of 


On(t) = Py(t) + aPy—-1(t) + BPn-2(t), n> 2, 


where { P;,} are the Legendre polynomials, and assume a € R, 6 < 0, and the 
zeros t, real and pairwise distinct. Show that the Newton—Cotes formula 


1 n 
[fot = Vwi fe) + EP). Ex) = 0. 
= k=1 


has all weights positive: w, > 0 fork = 1,2,...,n. {Hint: define A; (t) = 
[Cx (t)]? — x(t) and show that f', Ag (t)dt < 0.} 
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26. 


27 


28. 


29. 
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(a) Determine by Newton’s interpolation formula the quadratic polynomial p 
interpolating f at x = 0 and x = | and f’ at x = O. Also, express the 
error in terms of an appropriate derivative (assumed continuous on [0,1]). 

(b) Based on the result of (a), derive an integration formula of the type 


1 
i Flx)dx = ao f(0) + a1 f(1) + bof’) + E(f). 


Determine do, a1, bo and an appropriate expression for E(f). 
(c) Transform the result of (b) to obtain an integration rule, with remainder, for 
i ai y(t)dt, where h > 0. {Do not rederive this rule from scratch.} 
Imitate the procedures used in the Example of Chap. 2, Sect. 2.1.4(2) for monic 
Legendre polynomials to show orthogonality on [0, 00) relative to the Laguerre 
measure dA(t) = t*e~'dt, a > —1, of the (monic) polynomials 


a‘ : 
m(t) = (—1ir te’ ere), k =0,1,2,..., 


and to derive explicit formulae for the recursion coefficients a,, By. {Hint: 

express a and Bx, in terms of the coefficients Ax, uy in y(t) = t* + Age! + 

pth? $e 3} 

Show that k k 
Ow tha —1) pdt 


( 
T(t) = SE-e aek 


are the monic orthogonal polynomials on R relative to the Hermite measure 

dA(t) = e* “dt. Use this “Rodrigues formula” directly to derive the recurrence 

relation for the (monic) Hermite polynomials. 

(a) Construct the quadratic (monic) polynomial z2(-; w) orthogonal on (0, co) 
with respect to the weight function w(t) = e'. {Hint: use f° t"e~‘dt 
=m!.} 

(b) Obtain the two-point Gauss—Laguerre quadrature formula, 


(eo), k=0,1,2,..., 


7: fOetd =w f+ wi) + EC). 


including a representation for the remainder E>(/). 

(c) Apply the formula in (b) to approximate J = e ‘dt/(t + 1). Use the 
remainder term E>(f) to estimate the error, and compare your estimate 
with the true error {use J = 0.596347361 ...}. Knowing the true error, 
identify the unknown quantity € > 0 contained in the error term F>(/). 

Derive the 2-point Gauss—Hermite quadrature formula, 


/ f (te dt = wi f(t) + wo f(t) + Er(f), 


including an expression for the remainder F2(/). {Hint: use i edt = 
COU aE an a ee ee 


ni22" 2? 
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30. 


SA; 


32. 


33. 
34, 


Let 2,(-;w) be the nth-degree orthogonal polynomial with respect to the 
weight function w on [a, D], t,t2,...,t) its m zeros, and w1, W2,...,W, the 
n Gauss weights. 


(a) Assuming n > 1, show that the n polynomials zo, 71,...,%n—1 are 
also orthogonal with respect to the discrete inner product (u,v) = 
a1 Wout, v(t). 

(b) With (0) = []al@ -— 4)/G —&)),i = 1,2,..., n, denoting the 
elementary Lagrange interpolation polynomials associated with the nodes 
ti, to,...,t,, Show that 


b 
/ £;(t)lx(t)w(t)dt = 0 if i #k. 


Consider a quadrature formula of the type 


/ e * f(x)dx = af(0) + bf(c) + E(/). 


0 


(a) Find a, b, c such that the formula has degree of exactness d = 2. Can 
you identify the formula so obtained? {Point of information: i. e *x"dx 
= rt} 


(b) Let po(x) = po(f;0,2,2;x) be the Hermite interpolation polynomial 
interpolating f at the (simple) point x = 0 and the double point x = 2. 
Determine i e * p2(x)dx and compare with the result in (a). 

(c) Obtain the remainder E(f) in the form E(f) = const- f’’(&), & > 0. 


In this problem, z; (- ; w) denotes the monic polynomial of degree j orthogonal 
on the interval [a, b] relative to a weight function w > 0. 


(a) Show that z,(-;w), m > 0, has at least one real zero in the interior of [a, b] 
at which z,, changes sign. 


(b) Prove that all zeros of 2,(-;w) are real, simple, and contained in the 
(n) {(”) (n) 


interior of [a, b]. {Hint: put ro = max{r > 1: t,, ty,'..--.t,,° are distinct 
real zeros of z, in (a,b) at each of which z, changes sign}. Show that 
ro =n.} 


Prove that the zeros of z,,(-; w) interlace with those of 7,,41(-;w). 
Consider the Hermite interpolation problem: Find p € P2,—; such that 


(*) PQ) fy. Bla T ss VR Lidice: 
There are “elementary Hermite interpolation polynomials” h,,, k, such that the 
solution of (*) can be expressed (in analogy to Lagrange’s formula) in the form 


n 


Pt)= > hWOhthOfl. 


v=1 
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36. 


37, 


38. 
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(a) Seek h, and k,, in the form 
hy (t) = (ay + how; ky (t) = (cy + dtl? @), 


where ¢, are the elementary Lagrange polynomials. Determine the con- 
stants a,, D,, cy, dy. 
(b) Obtain the quadrature rule 


b n 
J feowera = OB fe) + we f OF EWA) 
¢ v=1 


with the property that E,,(f) = 0 for all f € Po,-1. 
(c) What conditions on the node polynomial @, (t) = []/_, (t — t) (or on the 
nodes t,) must be imposed in order that w, = 0 forv = 1,2,...,? 


Show that i (1—1)~!/? f(t)dt, when f is smooth, can be computed accurately 
by Gauss—Legendre quadrature. {Hint: substitute 1 — t = x?.} 

The Gaussian quadrature rule for the (Chebyshev) weight function w(t) = (1 — 
t?)—!/2 is known to be 


1 n 
— 72)-"gt w = Cc Cc _ 2k —1 
- FU =P) Pat x — dS) i cos ( el 


(The nodes i are the n Chebyshev points.) Use this fact to show that the unit 
disk has area z. 

Assuming f is a well-behaved function, discuss how the following integrals can 
be approximated by standard Gauss-type rules (i.e., with canonical intervals 
and weight functions). 


b 
(a) f f(x)dx (a <b). 
(b) foe f(x)dx (a > 0). 
(c) eae enterths) F(x )dx (a > 0). {Hint: complete the square.} 
(d) he edt, x>0, y>0. Is the approximation you get for the integral 
too small or too large? Explain. 


(a) Let w(t) be an even weight function on [a,b],a < b,a+b = 0, i-e., 
w(—t) = w(t) on [a, b]. Show that (—1)"2,(—t;w) = 2, (t;w), ie., the 
(monic) nth-degree orthogonal polynomial relative to the weight function 
w is even [odd] for 1 even [odd]. 

(b) Show that the Gauss formula 


b n 
J feowea = Yn flo) + En) 


v=1 
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for an even weight function w is symmetric, i.e., 
tnticv = —ty, Watia = Wy, V=1,2,...,0. 
(c) Let w be the “hat function” 


1l+¢t if -l<t<0, 
w(t) = 
l-¢ if0<?t<1. 


Obtain the 2-point Gaussian quadrature formula fia f(t)w)dt = 
wi f(t) + wo f(r) + E2(f) for this weight function w, including an 
expression for the error term under a suitable regularity assumption on /. 
{Hint: use (a) and (b) to simplify the calculations.} 

39. Let ae be the nodes, ordered monotonically, of the (27)-point Gauss— 


Legendre quadrature rule and we” the associated weights. Show that, for any 
Pp € Pon-1, one has 


1 n 
= 2n n 
/ 1"? p(t)dt = 2 we” p(t” P). 
0 k=1 


40. Let f be a smooth function on [0, z]. Explain how best to evaluate 


ci 1 a 1 B 
Ina) = f Fe) eos 54] sin 56] dé, a>-l, B>-l. 


41. Let O, f, Qn» f be n-point, resp. n*-point quadrature rules for [f = 
rp f(t)w(t)dt and Q,,+ f at least twice as accurate as QO, f, i.e., 


1 
|On« f —If| < 2 |O, f —If|. 
Show that the error of Q,,* f then satisfies 


|On« f —If| = lOnf — On* f |. 


42. Given a nonnegative weight function w on [—1, 1] and x > 1, let 


a| 
(G) i. £0 


x2 


w(t) n 
= ae = DWF) + En) 


be the n-point Gaussian quadrature formula for the weight function 0, . (Note 
that i ‘ we both depend on n and x.) Consider the quadrature rule 


1 n 
J stom@at = Yo me) + Ente) 


k=1 
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44. 


45. 
46. 


47. 
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where t = 10, we = [x* — (12)?]w@. Prove: 


(a) E,(g) = Oif g@) = = 


Exe 


(b) Ifn > 2, then E,,(g) = 0 whenever g is a polynomial of degree < 2n — 3. 


Let &,, v = 1,2,...,2n, be 2m preassigned distinct numbers satisfying —1 < 
&, < 1, and let w be a positive weight function on [-1,1]. Define @.,(x) = 
TE (1 + &,x). (Note that @2,, is positive on [—1,1].) Let xe, we be the nodes 


and weights of the n-point Gauss formula for the weight function w*(x) = 
wx), 
@2n(x)* 


1 n 
[peo (ax = Sw r@P), pe Par, 
= k=1 


Define xf = xe, we = WE Orn @P). Show that the quadrature formula 


1 n 
/ flsyw(adde = wh £6) + EFA) 
- k=1 


is exact for the 2n rational functions 


f(x) = , = 1,2, cag Bn, 


1+ &x 
(a) Prove (3.46). 
(b) Prove (3.47). {Hint: use the Christoffel-Darboux formula of Chap. 2, 
Ex. 21(b).} 
Prove (3.50). {Hint: prove, more generally, (1?*'g)(s) = fp ee e(t)dt.} 
(a) Use the method of undetermined coefficients to obtain an integration rule 
(having degree of exactness d = 2) of the form 


1 
[ y(s)ds © ay(0) + by(1) — cp) — yO). 


(b) Transform the rule in (a) into one appropriate for approximating 


cue f(t)dt. 


(c) Obtain a composite integration rule based on the formula in (b) for 


approximating ie J (t)dt. Interpret the result. 
Determine the quadrature formula of the type 


2 


1 -1/ 1 
i (OU 7 Fat + 0 f(0) +0 / (dt + E(f) 
=j =i 1/2 


having maximum degree of exactness d. What is the value of d ? 
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48. 


49. 


50. 


51. 


52. 


53. 


54. 


(a) Determine the quadratic spline s2(x) on [—1, 1] with a single knot at x = 0 
and such that s(x) = 0 on [—1, 0] and s2(1) = 1. 
(b) Consider a function s(x) of the form 


S(x) =co text Cox? + €352(X), c; = const, 


where 52(x) is as defined in (a). What kind of function is s? Determine s 
such that 


s-)D= fu, sO=f. "O=f, sH=A, 
where f_1 = f(-1), fo = f), fo = f'O), A = f() for some 


function f on [—1, 1]. 
(c) What quadrature rule does one obtain if one approximates fy Ft (x)dx by 
ie s(x)dx, with s as obtained in (b)? 
Prove that the condition (3.68) does not depend on the choice of the basis 
1, P2,--+5QPn- 
Let E be a linear functional that annihilates all polynomials of degree d > 0. 
Show that the Peano kernel K,(t), r < d, of E vanishes for t ¢ [a,b], where 
[a, b] is the interval of function values referenced by E. 
Show that a linear functional E satisfying Ef = e4;f"F), f € [a,d], 
é-+1 # 0, for any f € C’*![a, b], is necessarily definite of order r if it has a 
continuous Peano kernel K,.. 
Let £ be a linear functional that annihilates all polynomials of degree d. Show 
that none of the Peano kernels Ko, Kj,..., Kg—; of E can be definite. 
Suppose in (3.61) the function f is known to be only once continuously 
differentiable, i.c., f € C![0, 1]. 


(a) Derive the appropriate Peano representation of the error functional Ef. 
(b) Obtain an estimate of the form |Ef| < coll f’llco- 


Assume, in Simpson’s rule 


I 
1 
J fener = SUFI) + 40) + FO) + EY). 
-1 
that f is only of class C*[-1,1] instead of class C*[—1,1] as normally 
assumed. 


(a) Find an error estimate of the type 
JES(f)| <const-||f"loo, If"loo = _max | f"()| 


{Hint: apply the appropriate Peano representation of ES(f).} 
(b) Transform the result in (a) to obtain Simpson’s formula, with remainder 
estimate, for the integral 
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56. 


EYE 


58. 
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c+h 
/ g(Odt, geC7[c—h,c+h], h>0. 
ch 


(c) How does the estimate in (a) compare with the analogous error estimate for 
two applications of the trapezoidal rule, 


1 
J fosar = 517-1) + 2/0 + FO) + ETL? 


Determine the Peano kernel K\(t) on [a,b] of the error functional for the 
composite trapezoidal rule over the interval [a, b] subdivided into 1 subintervals 
of equal length. 

Consider the trapezoidal formula “with mean values,” 


1 € 1 
/ fonax = 5 [= f fdr k= feos + EC), ie. 
0 2 Le Jo € Jie 2 


(a) Determine the degree of exactness of this formula. 
(b) Express the remainder E(f) by means of its Peano kernel K, in terms of 
f", assuming f € C?[0, 1]. 
(c) Show that the Peano kernel K is definite, and thus express the remainder 
in the form E(f) = eo f"(t),0<t <1. 
1 


(d) Consider (and explain) the limit cases ¢ | 0 and e > 5 


(a) Use the method of undetermined coefficients to construct a quadrature 
formula of the type 


1 
i Flx)dx = af (0) + bf (1) +ef"() + E(f) 


having maximum degree of exactness d, the variables being a, b, c, and y. 

Show that the Peano kernel Ky of the error functional E of the formula 

obtained in (a) is definite, and hence express the remainder in the form 

E(f) = eas f 94%), 0 <& <1. 

(a) Use the method of undetermined coefficients to construct a quadrature 
formula of the type 


(b 


we 


1 
[ fooax =-af'0) + BF (5) ear") +B) 
0 


that has maximum degree of exactness. 


(b) What is the precise degree of exactness of the formula obtained in (a)? 


wm 
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60. 


61. 


62. 


(c) Use the Peano kernel of the error functional E to express E(f) in terms of 
the appropriate derivative of f reflecting the result of (b). 

(d) Transform the formula in (a) to one that is appropriate to evaluate 
fs a g(t)dt, and then obtain the corresponding composite formula for 


a : g(t)dt, using n subintervals of equal length, and derive an error term. 
Interpret your result. 
Consider a quadrature rule of the form 


[ x reoax = 470+ [feos a>-l, a0. 
0 0 


(a) Determine A and B such that the formula has degree of exactness d = 1. 

(b) Let E(f) be the error functional of the rule determined in (a). Show that 
the Peano kernel Ki (t) = E(.)((x —t)+) of E is positive definite if a > 0, 
and negative definite if a < 0. 

(c) Based on the result of (b), determine the constant e2 in E(f) = eof” (é), 
O0<& <1. 


(a) Consider a quadrature formula of the type 


1 
(") | f(x)dx = of (1) + BLL) — f] + ECP) 


and determine a, 8, x; such that the degree of exactness is as large as 
possible. What is the maximum degree attainable? 

(b) Use interpolation theory to obtain a bound on |E(f)| in terms of 
| FO lloo = maxo<y<1 | f”(x)| for some suitable r. 

(c) Adapt (*), including the bound on |E(/)|, to an integral of the form 
‘a jf (t)dt, where c is some constant and h > 0. 

(d) Apply the result of (c) to develop a composite quadrature rule for if ” f (t)dt 
by subdividing [a, b] into n subintervals of equal length h = pea Find a 
bound for the total error. 

Construct a quadrature rule 


1 I 1 
/ x* flax ~ ay | flaydx + a2 | xf(x)dx, O<a <1, 
0 0 0 


(a) which is exact for all polynomials p of degree < 1; 
(b) which is exact for all f(x) = x!/* p(x), p € Py. 


Let 


b 
a=Xo < xX <X2 <0 < Xp <x, =D, x, =atkh, h= : 
n 


be a subdivision of [a, b] into n equal subintervals. 
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63. 


64. 


(a) Derive an elementary quadrature formula for the integral f i tl F(x) dx, 
including a remainder term, by approximating f by the cubic Hermite in- 
terpolation polynomial p3(f'; xx, Xk, Xk+1, Xe+13 X) and then integrating 
over [xx, X¢+1]. Interpret the result. 


(b) Develop the formula obtained in (a) into a composite quadrature rule, with 


wm 


remainder term, for the integral I . Ft (x)dx. 


(a) Given a function g(x, y) on the unit squareO < x < 1,0 < y <1, 
determine a “bilinear polynomial” p(x, vy) = a + bx + cy + dxy such 
that p has the same values as g at the four corners of the square. 

(b) Use (a) to obtain a cubature formula for pe i, g(x, y)dxdy that involves 

the values of g at the four corners of the unit square. What rule does this 

reduce to if g is a function of x only (i.e., does not depend on y)? 

Use (b) to find a “composite cubature rule” for is ri g(x, y)dxdy involv- 

ing the values g;; = g(ih, jh), i, 7 =0,1,...,n, where h = 1/n. 

Let d\(h) = (f(A) — f(0))/h, h > 0, be the difference quotient of f 

at the origin. Describe how the extrapolation method based on a suitable 

expansion of d\(h) can be used to approximate f’(0) to successively higher 
accuracy. 

(b) Develop a similar method for calculating f” (0), based on d2(h) = [f(A)— 
2f(0) + f(-h)]/h?. 


(c 


wm 


(a 


wm 
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1. 


Let f(x) = — and f; = f(ih), i = —2,—1,0,1,2. In terms of the four 
backward differences 
VA=h-f, VA=f-2fot fu, 
Vh=fh—-3hit3f-fa, Wh=f—-4fit 6f—4fit fo, 
define i 
en (h) = f™(0)- he V" fleet n= 1,2,3,4. 


Try to determine the order of convergence of e, (1) as h — 0 by printing, for 
n=1,...,4, 


0 
pty ai RTS A 
€n(hg-1) 


Lae 
where hy = i‘ 2-“,k => 0. Comment on the results. 
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2. Let 


1 
f(2 =tanz, [z| < rue 


(a) Express 


} . 1 
ym(O) = Re fe"? f(xo + re®)}, O<r< 5” m=1, 2,3... 


explicitly as a function of 6. {Hint: use Euler’s identities sinz = (e”% — 
e!2)/(27), cosz = (e% + e~!*)/2, valid for arbitrary complex z.} 


(b) Obtain the analogue to (3.19) for the mth derivative and thus write f") (xo) 
as a definite integral over [0, 27]. 
(c) Use Matlab to compute f”)(0) form = 1 : 5 using the integral in (b) in 


conjunction with the composite trapezoidal rule (cf. Sect. 3.2.1) relative to a 
subdivision of [0, 277] into n subintervals. Use r = Der, p = 1,2,3,4,5, 
and nm = 5:5:50. For each m print a table whose columns contain r, 
n, the trapezoidal approximation on) and the (absolute) error, in this 
order. Comment on the results; in particular, try to explain the convergence 
behavior as r increases and the difference in behavior for n even and n 
odd. {Hint: prepare plots of the integrand; you may use the Matlab routine 
spline for cubic spline interpolation to do this.} 

(d) Do the same as (c), but for f™ (bn) andr = pT. 

(e) Write and run a Matlab program for approximating f”)(0),m = 1 : 5, by 
central difference formulae with steps h = 1, i, x a as Comment on 
the results. 

(f) Do the same as (e), but for f@ (bn) andh = an 


1 
> Too ™ 


I I 
» 3007 40007: 


3. Given n distinct real nodes x, = rue the interpolatory quadrature rule 


b n 
(WNC,,) i Ff (x)w(a)dx = So wy? f(x?), all f € Pra. 
@ k=1 


is called a weighted (by the weight function w) Newton—Cotes formula 
(cf. Sect.3.2.2). The weights wi can be generated by n,-point Gauss 
integration, ng = |[(n + 1)/2], of the elementary Lagrange interpolation 
polynomials (see (3.39)), 


n 


b —) 
(W,) wf? =f eecowonde, to) = [] —. 
a ?=1 Xk — Xe 


t#k 


This is implemented in the OPQ routine Newt Cotes .m downloadable from the 
web site mentioned in MA 4. For reasons of economy, it uses the barycentric 
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form (see Chap. 2, Sect.2.2.5, (2.106)) of the Lagrange polynomials and the 
algorithm (ibid., (2.108)) to compute the auxiliary quantities ie . Use the 


(n) 


routine NewtCotes.m to explore the positivity of (WNC,,), ie., w,° > 0, 
ci Py (6 


(a) Write a Matlab function y=posNC (n,ab,ab0,eps0), which checks 


(b 


(d 


(e 


wm 


wm 


wa 


wm 


the positivity of the n-point Newton—Cotes formula with the abscissae 
being the zeros of the Jacobi polynomial Pye ) with parameters a, 6, 
and using integration relative to the Jacobi weight function w = w'-0) 
with other parameters a, Bo. The former are selected by the OPQ routine 
ab=r_jacobi (n,alpha,beta), from which the Jacobi abscissae can 
be obtained via the OPQ function xw=gauss (n, ab) as the first column 
of the n x 2 array xw. The weight function w is provided by the routine 
ab0=r_jacobi (floor ((n+1)/2),alpha0,beta0), which allows 
us to generate the required 1 g-point Gaussian quadrature rule by the routine 
xw=gauss (ng, ab0). The input parameter eps0O, needed in the routine 
Newt Cotes .m, is anumber close to, but larger than, the machine precision 
eps, for example ¢9 = 0.5 x 10~'4. Arrange the output parameter y to 
have the value 1 if all 7 weights of the Newton—Cotes formula (WNC,,) are 
positive, and the value 0 otherwise. 

Use your routine for alln < N = 50,a = Bo = 0,-1/2,1/2, and 
a=-l+h:h:at,B=a:h: Bt, whereat = Bt = 3,1.5,4 and 
h = 0.05, 0.025, 0.05 for the three values of a, Bo, respectively. Prepare 
plots in which a red plus sign is placed at the point (a, 6) of the (a, B)- 
plane if positivity holds for alln < N, and a blue dot otherwise. Explain 
why it suffices to consider only 6 > a. {Hint: use the reflection formula 
PP (x) = (-1)’ P&P) (—x) for Jacobi polynomials.} In a second set of 
plots show the exact upper boundary of the positivity domain created in the 
first plots; compute it by a bisection-type method (cf. Chap. 4, Sect. 4.3.1). 
(Running the programs for N = 50 may take a while. You may want to 
experiment with smaller values of N to see how the positivity domains vary.) 
The plots in (a) suggest that n-point Newton—Cotes formulae are positive for 
alln < N = 50 on the line 0 < 6B = @ up toa point @ = max. Use the 
same bisection-type method as in (a) to determine max for the three values 
of a, Bo and for N = 20, 50, 100 in each case. 

Repeat (a) witha = B = 0,—1/2,1/2 (Gauss—Legendre and Chebyshev 
abscissae of the first and second kinds) and ay = —0.95 : 0.05 : ag, Bo = 
a :0.05: Bo, whereas = Br = 3.5,3,4. 

Repeat (a) withat = Bt = 6,4, 6, but for weighted (n + 2)-point Newton— 
Cotes formulae that contain as nodes the points +1 in addition to the n Jacobi 
abscissae. 

The plots in (d) suggest that the closed (1 + 2)-point Newton—Cotes formulae 
are positive for all < N = 50 on some line Qmin < @ = B < Qmax. 
Determine Qin aNd Qmax Similarly as max in (b). 
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(f) Repeat (c) for the weighted closed (1 + 2)-point Newton—Cotes formula of 
(d) withagn = -l1+h:h: at, Bo =ao:ih: Bo. where a(t — bo = 
0.5, —0.2, 1 and h = 0.01, 0.01, 0.02 for the three values of a, B. 


4. Below are a number of suggestions as to how the following integrals may be 


computed, 
I — I [ sink | 
c= x, cS x 
ie ae 0 Vx 


(a) Use the composite trapezoidal rule with n intervals of equal length h = 1/n, 
“ignoring” the singularity at x = 0 (i.e., arbitrarily using zero as the value 
of the integrand at x = 0). 

(b) Use the composite trapezoidal rule over the interval [/, 1] with n —1 intervals 
of length h = 1/n in combination with a weighted Newton—Cotes rule with 
weight function w(x) = x~!/? over the interval [0, h]. {Adapt the formula 
(3.43) to the interval [0, /].} 

(c) Make the change of variables x = t? and apply the composite trapezoidal 
rule to the resulting integrals. 

(d) Use Gauss—Legendre quadrature on the integrals obtained in (c). 

(e) Use Gauss—Jacobi quadrature with parameters a = 0 and 6B = —4 directly 
on the integrals J, and J/;. 


uw 


{As a point of information, I. = Vin ( 2) = 1.809048475800... , 


Tu 


I, = V2 S ( 2) = 0.620536603446..., where C(x), S(x) are the Fresnel 


integrals.} 

Implement and run the proposed methods form = 100: 100 : 1000 in (a) and 
(b), form = 20: 20: 200 in (c), and form = 1 : 10 in (d) and (e). Try to explain 
the results you obtain. {To get the required subroutines for Gaussian quadrature, 
download the OPQ routines r_jacobi.m and gauss.m from the web 
site http://www.cs.purdue.edu/archives/2002/wxg/codes/ 
OPQ.html .} 

5. For a natural number p let 


1 
In= / (1 —t)? f(t)dt 


be (except for the factor 1/p!) the pth iterated integral of /; cf. (3.50). Compare 
the composite trapezoidal rule based on n subintervals with the n-point Gauss— 
Jacobi rule on [0, 1] with parameters a = p and B = O. Take, for example, 
f(t) = tant and p = 5: 5: 20, and letn = 10: 10: 50 in the case of 
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the trapezoidal rule, and n = 2 : 2: 10 for the Gauss rule. {See MA 4 for 
instructions on how to download routines for generating the Gaussian quadrature 
rules.} 

6. (a) Let hy = (b —a)/2",k = 0,1,2,.... Denote by 


(b 


(c 


wm 


Nee 


2k—4 


Tu(f) = hu | 5 f+ flat rhe) +5 £0) 


r= 
the composite trapezoidal rule and by 


ok 


Mi (f) = he Df (a+(r- ;) ix) 
= 


the composite midpoint rule, both relative to a subdivision of [a, b] into 2* 
subintervals. Show that the first column 7T;.9 of the Romberg array {Tx} 
can be generated recursively as follows: 


Too = 5“ t/a) + 0) 


1 
Tk+1,.0 = 5 [To + Mn, (f)], k =0,1,2,.... 


Write a Matlab function for computing - iy JF (x)dx by the Romberg integra- 
tion scheme, with h,; = (b — a){2*, k=0,1,...,n—1. 

Formal parameters: a, b, n; include f as a subfunction. 

Output variable: the n x n Romberg array T. 

Order of computation: Generate T row by row; generate the trapezoidal sums 
recursively as in part (a). 

Program size: Keep it down to about 20 lines of Matlab code. 

Output: Tho, Ter, kK =0,1,...,n—1. 

Call your subroutine (with 7 = 10) to approximate the following integrals. 


2 ax 
1. / © dx (“exponential integral’’) 
1 x 
1: 
2. , eee ds (“sine integral’) 
0 xX 
1 a 
3: - [ cos(yx)dx, y= 1.7 
aH Jo 
1 a 
4. - [ cos(y sinx)dx, y= 1.7 
a Jo 


1 
oe / V1—x2dx 
0 
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6. i * (ite. f= 


2 
a F(x)dx, f(x) = 


(d) Comment on the behavior of the Romberg scheme in each of the seven cases 
in part (c). 
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3. (a) The kth derivative of Vey (x — x;) is a sum of products, each containing 
n —k factors x — x;. Thus, if x = xo, each term of the sum is OG"), 
hence also the sum itself. 

(b) By Lagrange interpolation, we have 


F(X) = Pal f5X) + Ra(X), 
where p,(f 3X) = Pn(f; Xo, %1,---;Xn}X), in Newton’s form, is given by 


Pal f 3x) =fo + (* — x0) [x0, xi] f ++ 
n—-1 


+ | [@ —x:)-[x0, m1,... mal f 


i=0 


and 
FEMEG) 
R,(x X— Xj ; 
os ne aD 
assuming f € C”*! in the span J of xo, X1,...,Xn, x. Differentiating n 


times at x = Xo gives 


d 
(*) FOC) = Fy Pa F3x)| + Ry? Co). 
=x 
Clearly, 
an 
(**) ae Pnl f; x) ! [X0,X1,--+5 Xn] 
Xx x=xo 
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Assuming that f €¢ C2”*!(1), we can apply Leibniz’s rule of differentiat- 
ing 1 times the product 


FEMEC) 
rues) == 99} FT = CES reo 


If we evaluate the result at x = xo, we get 


gu 
R (Xo) = ere =I | [@-) “E@) eo 


z (+0! 
i X=X0 
Using Leibniz’s rule again, and the result of (a), we obtain 
qo! n f° (&) 
RY = oe Ee, _ ~7# NY O H? ; 
m Go) = Ma iG ie OG 


where & = &(x0). Since 


n n 
| [@ 2) =x"-ox" 4+--, =o x, 


i=1 i=1 


we have 

d’- 1 n 

At | [@- x) =n! xo —(n—1)! 0, = (n— 1)! (2X9 — 01), 

xr 

i=1 xX=Xg 
so that 
(n+1) 
Of) = al ead Meo) 2 

(*« *& >) Ri" (xo) = 1! (nxo Ot ae Ty + O(H”). 
Since 


NXo — 0; = Yo (xo —x;) = O(A) if nx9 404, 
i=l 

the assertion follows by combining (*)—(***). 
One has 
_ A" fo = aa tes 
This is proved by induction on n. We show it for the forward difference 
only; for the backward difference the proof is analogous. Since the claim is 
obviously true form = 1, suppose it is true for some n. Then by definition 
of divided differences, and the induction hypothesis, 


Pi Fipacaly = 
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ives ely = Dost T 


[xo, X1, wee Xny Xn4il f = 
Xn+1— X0 
_ 1 A"fi-A"f 
mh" (n+I1)h 
1 
= Att : 
(n + 1)!nrtl fo 
which is the claim for n + 1. Therefore, since eee xy = x t+ 


(n + 1)h/2 4 xo, we get 


1 


1 
F™ Ga) ee A" fo +4 O(h) = 
hr hn 


V" fr + O(A). 


14. With a slight difference in notation, one has from (2.129) of Chap. 2 that 


hBo(x) =x -—xX, Ho SX SM; 
X—Xp-1 if Xe SX <M 

hB,(x) = " k=1,2,...,N—1; 
Xeti—X if x, SX < Xeq1, 


hBy(x) =x —Xy-1, Xy-1 SX S XN. 


From this, one gets, with obvious changes of variables, 


x] : XN : 
/ Bo(x)e""*dx + i By(x)e""*dx 


ae XN-1 
: 1 . 
= af (1 = feo di A. nf (1 = fe im Ga—th) ay 
0 
1 
= 2h f (1 — ft) cos(mth)dt, 
0 
and, fork = 1,2,...,N—1, 
Xk-+I 
/ By (x)e7"* dx 


Xk-1 


1 Xk 7 1 Xk+1 ; 
= ;/ (x _ Xp—-1)e "dx + “| (Xk+1 = xje""* dx 
h Xk—-1 h XK 
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! l 
= af ad = the ina —th) ay + af (1 _ tenet th) ay 
0 0 
. 1 
= Deri i: (1 —t) cos(mth)dt. 
0 


Since (cf. Chap. 2, (2.132)), with f; = f (xx), 
N-1 
si(f:x) = foBolx) + >> fe Be(x) + fv By (a), 


k=1 


and fy = fo, we get 
1 2n . 
=| si(fsx)e "dx 
20 0 


1 XxX] : XN : 
= om fo (/ Bo(x)e""* dx +f By (syed 


0 XN-1 


, No sa . 
a nga : B,(x)e7""™ dx 
rane / — Be) 


rh N-1 : 1 
= Ke / (1 — t) cos(mth)dt 
20 y 0 


l N-1 
tm . WN » Ke, 
k=0 


where 
1 
—— 2 | (1 — t) cos(mth)dt. 
0 


Integration by parts yields 


sin? (5mh) 


1—cosmh) = 
(Smh)” 


Tm 


(mh)? 
hence, 


_ | singnz/N) ‘ 
0 | 
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The factor t,,, which modifies the composite trapezoidal sum, may be inter- 
preted as an “attenuation factor,” since as m — oo it tends to zero, whereas 
the composite trapezoidal sums, being periodic in m with period N, cycle 
through N values. For a theory of attenuation factors in Fourier analysis, see 
Gautschi [1971/1972]. 

22.(a) Since the Hermite interpolant reproduces f exactly if f € P,+2, the formula 


has degree of exactness d = n + 2. 


(b) There are 2n + 3 free parameters available to make the formula exact for the 


the first 2n +3 powers x’, r = 0,1,...,2n +2. Thus, the maximum degree 
of exactness is expected to be d = 2n 4 2. 


(c) Letting @, (x) = Tea — xx), we have, for any p € P,-1, 


(d 


wm 


ik Wn (x) p(x)x?(1 —x)dx = 0, 
0 


since the integrand is a polynomial of degree < 2n + 2, and hence the 
integral is equal to the quadrature sum. The latter, however, vanishes, since 
the integrand vanishes together with its first derivative at x = 0 and also 
vanishes at x = 1, and w,(x,) = 0 fork = 1,2,...,n. This shows that 
On(+) = In(-;x7(1 — x)dx), the Jacobi polynomial relative to the interval 
[0, 1], with parameters a = 1, B = 2. 

Let f(x) € Pon 42. Divide f(x) by x7(1 — x)a, (x): 


f(x) = ru= X)On(x)q(x) + r(x), ge Pai, re Py4o. 
Then 


1 1 1 
ee 274 
/ Ff (x)dx =| On(x)q(x)x7 (1 wax + f r(x)dx. 


By the orthogonality assumption and the fact that g € P,,—1, the first integral 
vanishes. For the second integral, since r € P,,+2 and the quadrature formula 
is Hermite interpolatory, we have 


n 


1 
i. r(x)dx = aor(0) + ayr’(0) + > wer (xp) + bor (1), 


0 k=l 


and using 
r(0) = f), ’) = f'O), rd) = FM), 
1 (xn) = fre) — XE. — XK) On (xn )G (XK) = f(%K), 


since again, @,(x;) = 0, we get 


1 n 
[Feeds = ao £00) + 4140) + Yo we fle) + Bo fC). 
k=1 
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This shows that the formula has degree of exactness 2n + 2. 
54. (a) Putting 


; 1 
B(f) = [foods ~ 51/1) + 4f00) + FO), 

one has, if f € C[—1, 1], the Peano representation 
1 

Ef) = f KOs" oa, 
-1 

where, for —1 < ft < 1, 
Ki(t) = Evx)((x — t)+) 
: 1 
= f = nae- 511-94 +494 +09) 


1 | 4(0 ift>0) 1 
pt 0=— =-f=% 
3 3 | _+ itt <0 


II 


1 
gH 38 ift > 0, 


II 


1 
gtd +31) ift <0. 


It can be seen that K,(—t) = K,(t) and K(3) = K(1) = 0 (see figure on 
the next page). Therefore, |E(f)| < || flloo f_, |Ki(|d¢. 
Now, 


1 1 1/3 1 
i |Ki(t)|dt = 2 | |Ki(t)|dt = 2( K\(t)dt -[ Koa 
-1 0 0 1/3 


1 1/3 1 

a (/ (1 —t)(1 — 3t)dt +f (1 —1t)(3t- par) 
3 \Jo 173 
1 1/3 1 

=. (/ (1 — 4t + 327)de +f (-1+4t - 39a) 
3 \ Jo 1/3 


iyi 21 a! je ses 2 ne. _ 8 
~ 3\3 9° 27 3 3 9 27) 81 


Thus, 


8 " 
ECA) S IF" lo: 
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0.1 + 


0.1 


1 -0.8-06-04-02 0 02 04 06 08 1 


(b) With the substitution f = c + xh, and applying (a), one obtains 


cth 1 
i: g(t)dt = nf g(c + xh)dx 
ci all 


—h 


h 
= [g(c —h) + 4g(c) + g(c +A) + Ex(g), 


so that 


8 a2 
|En(g)| < i max |5 g(c + xh) 


—l<x<1 


d2 
Since a g(c + xh) = h’g"(c + xh), one gets 
x 


8 
|En(g)| < 31 |e" lloofe—he+h]- 


(c) The Peano kernel K(f) in this case is (negative) definite, namely 


1 
—t(1+t) if -—Il<t<0O, 
Ki(t) = 4 2 


1 
==") if 0O<t <1, 
giving 


1 
ENN = £0) f Kiar =-7 FC). 
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hence 
: 1 " 
EDI S elf" loo 


This is only a little worse than the bound in (a). 

56.(a) An easy calculation shows that E(f) = Oif f(x) = 1 and f(x) = x. For 
f(x) = x’, one obtains E(f) = —2 (1 — e)(1—2e) < 0if0 <e < 4. It 
follows that d = 1. 

(b) With 


1 1 € 1 1 
ef) = [ fondx— 5 f food f sooar, 
one has 


1 1 € 
Ki) = Ey 94) = f @—Hdx- 5 [Ge - dade 


1 1 


-— —t)4dx. 
6 a )+dx 


Now 


: jae = 0a? 
[ @ pax =50-0% 


. 0ift>e 0 
(x —t)4dx = = 1 , 
[@-tdx if O<t<e x e-y, 
l fi@—Ddx if t>1-« 
x—thydx = 2"! 
js ss (_@—)da fre tae 
l-e = 
1 2 1 2 
eth -(l=¢ 
_ 45-9 rt, 


1 - 1 
play =Ue=0)1 e(1-1-38). 
Therefore, if 0 < ¢ < e, then 


1 1 1 1 
Ki(t) = —<0—-t -—-=(-’ -——-e(1—-t—<e], 
1) = 5 0-0-5 5 °( :) 
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(c) 


which, after some algebra, reduces to 
1, 
(1) Ki@t)==t(1-—], 0<t<e. 
2 E 
Ife <t<1-e, then 
Ki)=1a fy : L=% : 
ey ae” al 
that is, 
1 1 
(2) MGR anil ase e<t<l-e. 


Finally, if ¢ > 1 — e, then 


ee oe ee 
Ki) = 50 t) Se 5 i 
that is, 
(3) Ko) = 50-9? (1- 5) l-e<t<1 
1 2 de . Ss Ssils 
Therefore, 
1 
(4) E(f) = i Kin fra, 
0 


with K, as above in (1)-(3). 
Since | — x <0 when0<e< 5. it follows from (1) and (3) that 


K\(t) <0 if O<t<e or l-e<t<l. 


For eé < t < 1 — 6, one has from (2) that 


1 1 1 1 1 
Ki) =—Z1d-H+ Ges —pzell—e) + ge =— J all —2e) <0, 


1 
since 0 < € < rt Altogether, therefore, 


K,(t) <0 for O<t <1, 
and K is negative definite. Consequently, 


E(f)=eaf"(t), 0<1 <1, 
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with 


2 
nas (=) = a (i= 20 =e, 


as follows by an elementary calculation. 
(d) The limit case e | 0 gives us the known results for the (ordinary) trapezoidal 
formula. If ¢ > 5 , then e, — 0, that is, E( f') > 0. This is consistent with 


the fact that, for e > 7 ; 


ifi re ee ie 
sl: f f(x)dx + = _fooax] > S(x)dx. 


60. (a) The remainder vanishes for the first three powers 1, x, x? of x if 


a=l, 

1 

+ B= >, 
1 


Eliminating 6 from the last two equations gives the quadratic equation x - 
xy + Z = 0, which has two solutions, 


xy = : (1 + =) 
2 V3)" 

both located in (0, 1). Thus, 

a=1, n=5 (143). pax. 

2 V3 2/3 

With these values one gets, since x1 = 4 — 6, x} = +—B, and B? = 4, 
Btx))=4-x1-8 = 5-(5-8) (5-8)-8=-38 = 45 HO. 
so the maximum attainable degree of exactness is d = 2. 


(b) Each of the two quadrature formulae obtained in (a) (having three nodes and 
degree of exactness 2) is interpolatory. Therefore, if f € C>[0, 1], then 


FOE) gy, 


1 1 
E(f)= i Lf(x)—po(f:0, x1, 1s) dx= / x1) 81) 


There follows 


1 1 
|IE(f)| < a | f° lloos where y =} x(1 —x)|x — x, |dx. 
0 
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The numerical factor y can be written as 


X1 1 
y= i x(1 —x)(x, — x)dx + x(1 —x)(x — x) )dx, 


x1 


where the first integral evaluates to 


1 1 4 
6! 2” 
and the second to 
1 i 1 4 
—=x x x 
12 : i 7 
Thus, 
1 3 
y= 1D _ X1 + x} = x; 


1 
Using repeatedly the equation x =xX- 6° one can eliminate all higher 


powers of x: 
ik ct. 1 

y= ago gat gat gai(n-3) 

i 4 i sai 
io po) ae 6 

14 1 1) 1 1 
=r gata (- 3) +34 ("-3) 

oe 1 1 
“2 6°!" 6! 26 

14 1 1) 1 
=a-gtg(s-g)-a 
| 

~ 12 36 216 

ll 

~ 216 


The value of y, being independent of x, holds for both choices of x;. Thus, 


t. (G3) 
IEA s Se SL 
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(c) By the work in (a), 
cth 1 
f(t)dt = nf F(c + xh)dx 
0 
=h{f(e+ xh) + BIf(e +h) -— fO} + Ex), 
with 6 and x, as determined in (a). From the result in (b), 


E < h‘ G3) oo!c.c 1 
|En(f)(c)| < =~ 1296 IF oofe.e+h]- 


(d) Letting h = (b-—a)/n, & =a+kh, fp = f(t,), k =0,1,2,...,0, we 
have 


[ sou - a ae Ff (t)dt 


=h[f(to + 1h) + BUA — fo) + fl + x1h) + B( fr - fi) +-°: 
n—-1 


+ f(r + 1h) + BU — f+ >) Ex) G) 


k=0 


n—1 
=h | Yo fe + xh) + BLS) — F@IWp + En(f), 


k=0 


where, by the result in (c), 


n—-1 n—1 
JEn( P< DES) < F596 ee lissiaey 
k=0 
ll 


1 
5 apg th ‘s dX Fis “hens 


11 
a he Wie 
sag OOF 


63. (a) If p(x, y) = g(x, y) is to hold at the four corner points of the square, we 


must have 
a = g(0,0), 
a+b= g(1,9), 
at+c= g(0,1), 


a+b+c+d=g(l1,1). 
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Solving for a,b,c, and d gives 
a = g(0,0), 
b = g(1,0)— g(0,0), 
c = ¢(0,1)— (0,0), 


d = g(1,1)— g(,0) — g, 1) + (0,0). 


(b) If instead of g we integrate p (as determined in (a)), we get the 
approximation 


1 pl 1 pl 1 1 
i: i, g(x, y)dxdy = af / dxdy + bf ay | xdx 
0 Jo 0 Jo 0 0 


1 1 1 1 1 1 1 
+e] ax f yay +d f xax [ ydy =a+—-b+-=c4+-d, 
0 0 0 0 2 2 4 


hence, substituting from (a), 


1 1 1 
[fl ser naray ~ 910.0) + 5 [e.0) ~ 900.0) 
0 JO 


1 1 
+ 5 [gO, I) — 80,0] + 7 IgG, — g(1,0) — 8, 1) + 80, 0)], 


that is, 


1 1 1 
/ / gto, ydedy © 5 [g(0,0) + g(1.0) + £0.1) + gD) 
0 0 


If g(x, y) = g(x), this reduces to the trapezoidal rule. 
(c) The grid square 


Oi ={,y)i ihex<G@t+h, jheysG +h} 
is mapped by 
x=h(i+u), 
y=h(j +¥) 


onto the unit square 0 < u < 1,0 < v < 1. Since 


di). i oa 


d(u,v) |Oh 


232 


3 Numerical Differentiation and Integration 


we get 


/ / e(z, ydxdy 
Qi; 


1 1 
i / g(h(i +u),h(j + v)) -h?dudv 
0 0 


h2 
a. [gif + Situs + Sjtui + Sytiy4il- 


ru 


Summing over all grid squares gives 
1 1 : 1 1 
/ i; g(x, y)dxdy xh? 9S > Bij + 5 pa BiG +] yas 
0 0 (i,jyeS (i,j)eB (i,j)EC 


where S, B, C denote, respectively, the sets of interior grid points, interior 
boundary points, and (four) corner points. 
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3.(a) PROGRAMS 


SMAIII 3A Boundary of positivity domain 


oe 


N=50; 
eps0=.5e-14; 
[abound, bbound] =posdomainNC (N) ; 
ab0=r_jacobi (floor ((N+1)/2)); 
Sab0=r_jacobi (floor ((N+1)/2),-1/2); 
S$ab0=r_jacobi (floor ((N+1)/2),1/2); 
ib=find (bbound-abound) ; 
for i=1:size(ib,1) 
ap (i) =abound (i) ; 
bhigh=bbound(i); blow=abound (i) ; 
while bhigh-blow>.5e-5 
bm= (bhigh+blow) /2; 
ab=r_jacobi (N,abound(i),bm) ; 
y=1; 
for n=17N 
pos=posNC (n,ab,ab0,eps0) ; 
if pos== 
y=0; 
break 
end 
end 
if y== 
bhigh=bm; 
else 
blow=bm; 
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plot (ap, bp) 

(’ square’ ) 
plot ([-1 3], [-1 3 
splot ([-1 1.5], [- 
splot ([-1 4], [-1 
hold off 


ae 


oe 


POSDOMAINNC 


ole 


ole 


Positivity domain for Newton-Cotes formulae with 
Jacobi abscissae and integration relative to the 
weight function 1 and Chebyshev weight functions 
of the first and second kind. 


oP al? ol? 


ole 


function [abound, bbound] =posdomainNc (N) 
hold on 
abound=zeros (80,1); bbound=zeros (80,1); 
Sabound=zeros(100,1); bbound=zeros(100,1); 
Sabound=zeros(100,1); bbound=zeros(100,1); 
eps0=.5e-14; i=0; 
for a==,957.05:3 
for a=-.975:.025:1.5 
S$for a=-.95:.05:4 
i=i+1; abound(i)=a; k=0; 
for beat,.05¢3 
for bea: .025:1.5 
for bea: 05:4 
ab=r_jacobi(N,a,b)j; 
ab0=r_jacobi (floor ( (N+1)/2)) 
ab0=r_jacobi (floor ( (N+1) /2) 1-1/2) 
ab0=r_jacobi (floor ( (N+1)/2),1/2) 
y=1; 
for n=1iN 
pos=posNC (n,ab,ab0,eps0) ; 
if pos== 
y=0; 
break 
end 
end 
if y==0 
plot(a,b,’:’) 
axis (’square’ ) 
k=k+1; 
if k==1, bbound(i)=b; end 
else 
plot(a,b,’r+’) 
end 


ole 


ole 


ole 


ole 
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end 
end 
hold off 


ole 


POSNC 


ole 


ole 


Positivity of the n-point Newton-Cotes formula 

with abscissae equal to the zeros of the orthogonal 
polynomial associated with the nx2 recurrence matrix 
ab and integration being with respect to the measure 
identified by the floor((n+1)/2)x2 recurrence matrix 
abO. The input parameter eps0O, required in the routine 
NewtCotes.m, is a number larger than, but close to, 
the machine precision. 


AP al ol? AP oP oP ol? 


0 


function y=posNC(n,ab,ab0,eps0) 
y=0; 

xw=gauss(n,ab); zn=zeros(n,1); 
w=Newt Cotes (n,xw(:,1),ab0,eps0) ; 
if w>zn, y=1; end 


Why does 6 > a suffice? To simplify notation, let 


Pa(x) = PLP (x), w(x) = A — x)" + x); 
P, (x) = PP (x), Wx) = (1 — x) + x). 


If x, denote the zeros of p,, and X,, those of P,,, one has X, = —x, by the 
reflection formula for Jacobi polynomials (see Hint). With 


Pn(x) P,, (x) 


C= 


£,(x) = @=xjP,)’ 


there follows 


_ 1 _ I P,,(x) 
W, = [ Ly (x) W(x) = [ («— X) P(X) W(x)dx 


; (-1)" Pn (—x) 
—1 (% + Xy)(—1)"*" po) 


= ; Pn(—x) B F 
--| “30 (1 + x)*dx 


= 7 Put) iW B 
= Gases 


_ : Pni(t) a 
=- fare ypiey TO t= 


(1s) eye 


showing that the two Newton—Cotes weights are the same. 
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For relevant literature, see Askey and Fitch [1968], Askey [1972], [1979], 
Micchelli [1980], Sottas [1982], [1988], [1989], and Gautschi [201 1a, 
Sect. 4.1]. 

In the case of constant weight function = 1, it appears that with increasing 
N the slanted portion of the upper boundary of the positivity domain slightly 
turns downward, and the horizontal portion slowly moves down. It is likely 
that in the limit N — oo the slope of the slanted portion tends to 1 and 
the height of the horizontal portion to 3/2, which would be consistent with a 
conjecture of Askey (1979) (except for the slanted portion of the boundary). 
On the line 6 = 1.55, for example, spotchecking with a = —0.4,0:0.5: 1.5 
revealed nonpositivity of the n-point Newton—Cotes formula when n = 800 
(but not when n = 700). 

In the case of the Chebyshev weight function of the first kind, the height of 
the upper boundary is practically constant equal to 1/2 (already for N = 10 
and more so for larger values of NV). This is in agreement with a result proved 
by Micchelli (1980) in the case of Gegenbauer abscissae, a = f. 

In the case of the Chebyshev weight function of the second kind, the slope 
of the slanted portion of the boundary curve, as in the first case, seems to tend 
(slowly) to 1, and the height of the horizontal portion to 2.5. (Cf. also the third 
column in the output to MAIII_3B, which seems to confirm this, given the 
slowness of convergence as N — oo.) 

(b) PROGRAM 


SMAIII 3B Upper bound of Gegenbauer positivity 
interval 


oe 


f£0='%8.0£ $11.6f\n’'; 
disp (' N alpha_max’ ) 
eps0=.5e-14; 
ab0=r_jacobi (50) ; 
S$ab0=r_jacobi(50,-1/2) ; 
Sab0=r_jacobi(50,1/2); 
for N=[20 50 100] 
ahigh=2; alow=1.5; 
% ahigh=0.6; alow=0.4; 
% ahigh=3.3; alow=2.6; 
while ahigh-alow>.5e-6 
a=(ahigh+alow) /2; 
ab=r_jacobi(N,a) ; 
yl; 
for n=1:N 
pos=posNC (n,ab,ab0,eps0) ; 
if pos== 
y=0; 
break 
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PLOTS (N=50) for 3(a) 


Weight function = 1 


3 3; 
2.5 2.5} 
2 2t 
1.5 1.5} a 
1 if 
0.5 0.5} 
0 o} 
0.5 -0.5} 
= =05 0°65 4 15 2 26-2 1-05 0 05.4 45.2 28 3 
Chebyshev weight function of the first kind 
2 1.5 
5 1 
1 
0.5 
0.5 
0 
0 
05 -0.5 
1 1 
-1 -0.5 0 0.5 1 125 -1 -0.5 0 0.5 1 1.5 
Chebyshev weight function of the second kind 
5 i 
4 3.57 
3F 
3 2.57 
Qt 
2 1.5} 
1 ? 
1 
0.5; 
0 ai 
0.5; 


1 L L L L L L i L L i 
-1-050 05 115 2 25 3 35 4 


uy 05005 115 225 3 35 4 
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end 
end 
if y== 
ahigh=a; 
else 
alow=a; 
end 
end 
fprintf (f0,N,a) 
end 


OUTPUT 


>> MAIII 3B 


weight function = | Chebyshev #1 Chebyshev #2 
N alpha_max alpha_max alpha_max 
20 1.700560 0.500000 2.863848 
50 1.643718 0.500000 2.750502 
100 1.617770 0.500000 2.700881 
>> 
(c) PROGRAMS 


SMAIII_ 3C Boundary of positivity domain 
N=50; 
eps0=.5e-14; 
[a0bound, bObound] =posdomainNCo (N) ; 
ab=r_jacobi (N) ; 
Sab=r_jacobi(N,-1/2); 
Sab=r_jacobi(N,1/2); 
ib0=find (bObound-a0bound) ; 
for i=ib0(1):ib0(1)+size(ib0,1)-1 
a0p (i) =a0bound (i) ; 
bOhigh=b0bound (i); bOlow=a0bound (i) ; 
while bOhigh-b0low>.5e-5 
bOm= (bOhigh+b0low) /2; 
ab0=r_jacobi (floor ( (N+1)/2),a0bound (i) ,b0Om) ; 
y=1; 
for n=1:N 
pos=posNC (n,ab,ab0,eps0) ; 
if pos== 
y=0; 
break 
end 
end 
if y== 
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bOhigh=b0m; 
else 
bO0low=b0m; 
end 
end 
bOp (i) =b0m; 
end 
figure 
hold on 
plot ([a0p(ibO(1)) adp(ib0d(1))], [a0p(ib0d(1) ) 
bOp(ibd(1))]) 
axis (‘’square’ ) 
plot (a0p(ib0(1):size(a0p,2)),b0p(ib0(1):size(b0p,2))) 
plot ([-1 3.5], [-1 3.5]) 
Splot([-1 3],[-1 3]) 
splot([-1 4],[-1 4]) 
hold off 


%POSDOMAINNCO 

SPositivity domain for Newton-Cotes 

sformulae with Gauss-Legendre and Chebyshev abscissae 
sand integration relative to Jacobi weight functions. 


x 
© 


function [a0bound, b0bound] =posdomainNCo (N) 
hold on 
a0bound=zeros (90,1); bObound=zeros (90,1); 
Sa0bound=zeros (80,1); bObound=zeros (80,1); 
%Sa0bound=zeros(100,1); b0Obound=zeros(100,1); 
eps0=.5e-14; i=0; 
ab=r_jacobi(N) ; 
Sab=r_jacobi(N,-1/2); 
Sab=r_jacobi(N,1/2); 
For aQ==.95:.05:3.5 
Sfor a0=-.95:.05:3 
Sfor a0=-.95:.05:4 
i=i+1; aObound(i)=a0; k=0; 
for b0=a0:.05¢3.5 
for b0=a0:..05:3 
for b0=a0:.05:4 
abO0=r_jacobi (floor ((N+1)/2),a0,b0) ; 
y=1; 
for n=1:N 
pos=posNC (n,ab,ab0,eps0) ; 
if pos== 
y=0; 
break 
end 


2 
©. 
2 

oO 


end 
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(d) 


if y== 
plot (a0,b0,’:’) 
axis (‘square’) 
k=k+1; 
if k== 

else 
plot (a0,b0,‘r+’) 

end 

end 


, bObound(i)=b0; end 


end 
hold off 
PROGRAMS 


SMAIII_ 3D Boundary of positivity domain 
N=50; 
eps0=.5e-14; 
[abound, bbound] =posdomainNCl (N) ; 
ab0=r_jacobi (floor ((N+3)/2)); 
S$ab0=r_jacobi (floor ((N+3)/2),-1/2); 
S$ab0=r_jacobi (floor ((N+3)/2),1/2); 
ib=find (bbound-abound) ; 
for i=ib(1):size(ib,1) 
ap (i-ib(1) +1) =abound (i) ; 
bhigh=bbound(i); blow=abound (i) ; 
while bhigh-blow>.5e-5 
bm= (bhigh+blow) /2; 
ab=r_jacobi(N,abound(i),bm) ; 
y=1; 
for n=1:N 
pos=posNCi1 (n,ab,ab0,eps0) ; 
if pos== 
y=0; 
break 
end 
end 
if y== 
bhigh=bm; 
else 
blow=bm; 
end 
end 
bp (i-ib(1)+1)=bm; 
end 
figure 
hold on 
plot (ap, bp) 
axis (‘square’) 
plot ([0 0], [0 bp(1)]) 
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PLOTS (N=50) for 3(c) 
Gauss-Legendre abscissae 


4 3.5 
3.5 3 
3 25 
2.5 
2 
15 
15 
F 
; 
68 0.5 
F 0 
~0.5 -0.5 
| 4 . 
1-05 0 05 1 15 2 25 3 35 -1 050 05 115 2 25 3 35 
Chebyshev abscissae of the first kind 
3 3 
2.5 25 
2 2 
15 15 
1 1 
0.5 0.5 
ol ol 
-0.5 -0.5 
“1 05 0 05 1 15 2 25 3 “105 0 05 1 15 2 25 3 


Chebyshev abscissae of the second kind 


5 4 
3.5 

4 
3 
3 2.5 
2 
2 1.5 
1 

1 
0.5 
0 0 
-0.5 


-1 -0.5 0 05 1 15 2 25 3 35 4 -1-050 05 1 15 2 25 3 35 4 
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bpf=bp (size (ib,1)-ib(1)+1); 

plot ([ap(size(ib,1)-ib(1)+1),bpf], [bpf,bpf]) 
plot([-1 6],[-1 6]) 
Splot([-1 4],[-1 4]) 
Splot([-1 6],[-1 6]) 
hold off 


POSDOMAINNC1 


Positivity domain for closed Newton-Cotes formulae 
with Jacobi abscissae and integration relative to 
the weight function 1 and Chebyshev weight functions 
of the first and second kind. 


JP ANP cP ANP A'P ANP OP 


function [abound, bbound] =posdomainNCl (N) 
hold on 
abound=zeros(140,1); bbound=zeros(140,1)j; 
Sabound=zeros(100,1); bbound=zeros(100,1); 
Sabound=zeros(140,1); bbound=zeros(140,1); 
eps0=.5e-14; i=0; 
for a=-.95:.05:6 
Sfor a=-.95:.05:4 
Stor a=-.95:.05:6 
i=i+1; abound(i)=a; k=0; 
for b=a:.0526 
for b=a:.05:4 
for bea:.05:6 
ab=r_jacobi(N,a,b) ; 
ab0=r_jacobi (floor ((N+3)/2)); 
ab0=r_jacobi (floor ((N+3)/2),-1/2) 
ab0=r_jacobi (floor ((N+3)/2),1/2) 
y=1; 
for n=1:N 
pos=posNC1 (n,ab,ab0,eps0) ; 
if pos== 
y=0; 
break 
end 


I 


I 


oe ol? 


oe ol? 


end 

if y== 
plot:(a,;b; 2") 
axis (‘square’) 
k=k+1; 
if k==1, bbound(i)=b; end 

else 
plot(a,b,’r+’) 

end 

end 
end 
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hold off 
POSNC1 


Positivity of the (n+2)-point Newton-Cotes formula 
with abscissae equal to +- 1 and the n zeros of the 
orthogonal polynomial associated with the nx2 
recurrence matrix ab and integration being with 
respect to the measure identified by the floor 
((n+3)/2)x2 recurrence matrix abO. The input 
parameter eps0, required in the routione 
NewtCotes.m, is a number larger than, but close to, 
the machine precision. 


JP AAP AP ANP AIP ANP AP AP AP OP oP OP 


function y=posNC1 (n,ab,ab0,eps0) 


y=0; 
xw=gauss(n,ab); zn=zeros(n+2,1); 
xwl=[-1l;xw(:,1);1]; 


wl=NewtCotes (n+2,xw1,ab0,eps0) ; 
if wl>zn, y=1; end 


For relevant literature, see Notaris [2002], [2003]. 
(e) PROGRAM 


sMAIII 35 Bounds for the Gegenbauer 
positivity intervals 


£0='%8.0f£ %11.6f\n’; 

(' N alpha_max’ ) 
sdisp (' N alpha_min’ ) 
eps0=.5e-14; 
ab0=r_jacobi (51) ; 
Sab0=r_jacobi(51,-1/2) ; 
Sab0=r_ jacobi(51,1/2); 
for N=[20 50 100] 
ahigh=4.5; alow=3.5; 

ahigh=3; alow=2.5; 
ahigh=5.3; alow=4.5; 
ahigh=.1; alow=-.1; 
ahigh=-.4; alow=-.6; 
ahigh=.6; alow=.4; 
while ahigh-alow>.5e-6 
a=(ahigh+alow) /2; 
ab=r_jacobi(N,a) ; 
y=l; 
for n=1:N 
pos=posNC1 (n,ab,ab0,eps0) ; 


JP 0 WP oP ol? 
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PLOTS (N=50) for (3d) 
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Weight function = 1 
7 6 
6 5 
5 
4 
4 
3k 
3 
2 
2 
P 
if 
0 0 
-1 2 = | 
+ o 1 2 3 4 #5 6 + o 71 2 3 4 #5 6 
Chebyshev weight function of the first kind 
° 4 
3.5 
4 
SE 
3 2.5 
| ‘ 
2 15 
‘ 
; 
0.5 
0 
-0.5 
+ + . 
1-050 05 1 15 2 25 3 35 4 1-050 05 1 15 2 25 3 35 4 


Chebyshev weight functio 


n of the second kind 


7 6 
6 5 
5 
4 
4 
3 
3 
2 
2 
1 
Fi 
coal 
0 0 
-1 . =| 
-1 0 1 2 3 4 5 6 -1 0 1 2 3 4 5 6 
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if pos==0 
y=0; 
break 
end 
end 
if y== 
ahigh=a; 
% alow=a; 
else 
alow=a; 
% ahigh=a; 
end 
end 
fprintf (f£0,N,a) 
end 


In the case of constant weight function = | and Gegenbauer abscissae, 
one has positivity of the closed Newton—Cotes formulae when 0 < a = B < 
3.5 by a result of Kiitz (cf. Notaris [2002, p. 144]). This is consistent with 
our plot on the previous page and the output of MAIII_3E below. Also the 
positivity domain proved in Notaris [2002, Theorem 2.1(a),(b)] is indeed a 
small subdomain of the respective domain in our plot. 


OUTPUT 


>> MAIII 3E 
weight function = | Chebyshev #1 Chebyshev #2 


N alpha_min/max alpha_min/max alpha_min/max 

20 0.000000 4.012030 -0.500000 2.863848 0.500000 5.152099 
50 0.000000 3.841891 -0.500000 2,750502 0.500000 4.924714 
100 0.000000 3.769501 -0.500000 2.700881 0.500000 4.829984 


>> 


(f) PROGRAMS 


SMAIII_3F Boundary of positivity domain 


zg 
© 


N=50; 
eps0=.5e-14; 
[a0bound, bObound] =posdomainNC01 (N) ; 
ab=r_jacobi (N) ; 
sab=r_jacobi(N,-1/2) ; 
sab=r_jacobi(N,1/2) ; 
ib0=find (bObound-a0bound) ; 
for i=ib0(1):1ib0(1)+size(ib0,1)-1 
a0p (i) =a0bound (i) ; 
bOhigh=b0bound (i); bOlow=a0bound (i) ; 
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PLOTS (N=50) for 3(f) 
Gauss—Legendre abscissae 


0.6 0.5, 


-1 1 
-1 -0.5 0 0.5 -1 -0.5 0 0.5 


Chebyshev abscissae of the first kind 


-0.1 -0.2 
-0.2 -0.3 
-0.3 ~0.4 
0.4} 
-0.5 
-0.5 
a -0.6 
-0.6 
-0.7 
-0.7 
08 -0.8 
~0.9 -0.9 
=1 2 . = 
-1 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -1 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 


Chebyshev abscissae of the second kind 


1 1 
0.8 0.8 
0.6 0.6 

0.4 

0.2 

0 

-0.2 

-0.4 

-0.6 

-0.8 -0.8 
1 4 


“=-1-08-06-04-02 0 02 04 06 08 1 “=1 08-06-04 -02 0 02 04 06 08 1 
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while b0high-b0low>.5e-5 
bom= (bOhigh+b0low) /2; 
ab0=r_jacobi (floor ((N+3) /2) ,a0bound (i) ,bOm) ; 
y=1; 
for n=1:N 
pos=posNC1 (n,ab,ab0,eps0) ; 
if poses 
y=0; 
break 
end 
end 
if y== 
bOhigh=b0m; 
else 
bOlow=b0m; 
end 
end 
bOp (i) =bOm; 
end 
figure 
hold on 
plot (a0p(ib0 (1) :size(a0p,2)),bOp(ib0(1) :size(b0p,2) )) 
axis (’ square’ ) 
plot([-1 .5],[-1 .5]) 
Splot([-1 -.2],[-1 -.2]) 
Splot([-1 1],[-1 1]) 
hold off 


%POSDOMAINNCO1 


ole 


% Positivity domain for closed Newton-Cotes formulae with 
% Gauss--Legendre and Chebyshev abscissae and integration 


% relative to Jacobi weight functions. 


function [a0bound, bObound] =posdomainNC01 (N) 
hold on 
a0bound=zeros(150,1); b0Obound=zeros(150,1); 
%a0bound=zeros (80,1); bObound=zeros (80,1); 
Sa0bound=zeros(100,1); bObound=zeros(100,1); 
eps0=.5e-14; i=0; 
ab=r_jacobi (N) ; 
Sab=r_jacobi(N,-1/2) ; 
Sab=r_jacobi(N,1/2) ; 
for a0=-.99:.01:.5 
Sfor a0=-.99:.01:-.2 
Sfor a0=-.98:.02:1 

i=i+1; aQbound(i)=a0; k=0; 


for bO=a0:.01:.5 
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% for bO=a0:.01:-.2 
% for b0O=a0:.02:1 
ab0=r_jacobi (floor ((N+3) /2),a0,b0) ; 
y=1; 
for n=1:N 
pos=posNC1 (n,ab,ab0,eps0) ; 
if pos== 
y=0; 
break 
end 
end 
if y== 
plot (a0,b0,’:") 
axis (’ square’ ) 
k=k+1; 
if k==1, bObound(i)=b0; end 
else 
plot (a0,b0,‘r+’) 
end 
end 
end 
hold off 
6.(a) It suffices to derive the second relation: 


gett] 
Testo = hens 45 f+ 5 FO) + Yo flat Oa) 
l=1 
11 i. 
= 3 hk 5 @ + fOl+ 5 hk S> f(a + bie) 
ed) 
1 gk+1_]4 
+ 5 hk > f(a + €hx41) 
£=1 
(€ odd) 
i, |1 — I 
= 5 hk zf@t+ Yo fa + 2jhe+i) + 5 FO) 
j=l 
gk 
+5 he Do flat Qj — Diag) 
j=l 


ak—1 


1 1 1 
= 5h \5f@+t 2 era 5 fb) 
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+dnsos(a(s—-2)m) 


j=l 


1 1 
=x -~ M, ; 
2 ko + 5) ne (Sf) 


(b)—(d) PROGRAM 


SMAIII_6BC 


3 
© 


£0='%6.0£ $19.15£ %19.15£\n'; 


disp (’ i T(i,1) Ti 5.) "> 
n=10; 

a=1; b=2; 

%a=0; b=1; 

Sa=0; b=pi; 

%a=0; b=1; 

%a=0; b=2; 


T=romberg(a,b,n) ; 
for a=1 sn 
forint = (£0,450 (iy 1) Py 4)) 


end 


function T=romberg(a,b,n) 
T=zeros(1:n); 
h=b-a; m=1; T(1,1)=h« (f(a) +£(b))/2; 
for i=2:n 
h=h/2; m=2«*m; mml=m-1; 
k=(1:2:mm1)’‘; 
T(i,1)=T(i-1,1) /2+h*sum(f (at+h«k) ) ; 
1=1; 
for k=2:i 
1=4«1; 
T(i,k)=T(i,k-1)+(T(i,k-1)-T(i-1,k-1))/(1-1); 
end 


end 


function y=f (x) 

y=exp (x) ./x; 

sy=1; 

Sif x~=zeros(size(x,1),1) 
% y=sin(x) ./x; 

send 

Sy=cos (1.7*x) /pi; 
Sy=cos(1.7*sin(x)) /pi; 
Sy=sqrt (1-x.%2) ; 

Sy=zeros (Size(x,1),1); 


sfor i=1:size(x,1) 
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z 
© 


z 
© 


oe 


z 
© 


3 
© 


if x(i)<=sqrt (2) 


else 


end 


send 


y(i)=x(4); 


y (i) =sqrt (2) * (2-x(i))/(2-sqrt(2)); 


Sy=zeros (size(x,1),1); 


sfor i=1:size(x,1) 
if x(i)<=3/4 
y(i)=x(i); 


z 
© 


oe 


oe 


3 
© 


z 
© 


else 


end 


send 


OUTPUT 


y (i) =3«* (2-x(i))/5; 


>> MAIII_6BC 


>> 


B 


owMoU ON HD MH FF WwW DY FE P- 


cin 
-097098826260448 
-068704101194839 
-061519689433579 
-059717728013521 
-059266861956402 
-059154121802282 
-059125935283745 
-059118888561571 
-059117126875243 


WwW WWW WwW Ww Ww 


T(a 1) 
206404938962185 


>> MATIII_ 6BC 


>> 


BR 


oo ON DH FF WN FE P- 


O. 


So oo 6 So Oo Oo. oO 16 


Pisa) 
920735492403948 


-939793284806177 
-944513521665390 
-945690863582701 
-945985029934386 
-946058560962768 
- 946076943060063 
- 946081538543152 
- 946082687411347 
- 946082974628235 


>> MATII_6BC 


a 
HE 
2 
3 
4 


0. 


(as, 1) 
793892626146237 


-0.048556949021066 
-0.128279145629103 
-0.145813060666924 


WWW Www www w 


OR Oe oe oe oe oe oe oe ee) 


=O. 


-059 
-059 
-059 
-059 
i059 


r(i,i) 


-206404938962185 
-060663455359868 
-059 
-059 
-059 


44242004954 
16836818692 
16541002761 
16539648306 
16539645955 
16539645953 
16539645952 
16539645953 


r(i,i) 


- 920735492403948 
-946 
- 946083004063674 
- 946083070387223 
-946083070367181 
- 946083070367183 
-946083070367183 
-946083070367183 
- 946083070367183 
- 946083070367183 


45882273587 


T4549 


- 793892626146237 
-329373474076833 
=O. 


143218526971000 
151575238486816 


(i) 


(ii) 


(iii) 
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Ei (2) -Ei (1) 


SI (1) 


sin(1.7*pi) /pi 


250 


>> 


-O. 
-O. 
-O. 
=0.. 
=O). 
=O. 


150072135492423 
151129454951574 
151393324105836 
151459262675203 
151475745523767 
151479866123815 


>> MATII_ 6BC 


Pe 


B 


CoO MO ND UH BF WwW NY KP FP: 


es, 
-435577752852238 
-397997329127638 
-397984859446116 
-397984859446110 
-397984859446110 
-397984859446110 
-397984859446110 
-397984859446109 
-397984859446110 


oe oe Oe oe oe ee ee ee) 


Tis) 
000000000000000 


>> MATIII_ 6BC 


>> 


B 


Co oU MO IN HD UH FWY BP P- 


0. 
-683012701892219 
- 748927267025610 
-772454786089293 
- 780813259456935 
- 783775605719283 
- 784824228194921 
-785195198099154 
- 785326395739308 
- 785372788179914 


oo of fo co Oo oOo So SS 


T(t, dL) 
500000000000000 


>> MATIII_ 6BC 


at 


my 


CO MW YN DH FWD PB P- 


T(i,1) 


-000000000000000 
-000000000000000 
-353553390593274 
-390165042944955 
-408470869120796 
-412654727760896 
-413896991372851 
-414109407799875 
-414211586644613 
-414212593986806 


>> MAIII_6BC 


a 


1 


oO. 


T (db) 
000000000000000 


=O: 
=, 
-O. 
-O. 
205. 
-O. 


020 OOO OD OD OD oO PB 


S02 O OO OOD OD OD oO 
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151480973833174 
151481239834844 
151481239647168 
151481239647201 
151481239647201 
151481239647201 


T(i;a) 


-000000000000000 
-247437003802984 
-394672755713868 
-398880460382129 
-397968405925570 
-397984921711269 
-397984859403064 
-397984859446102 
-397984859446109 
-397984859446110 


T(i,i) 


-500000000000000 
- 744016935856292 
- 772690912262104 
- 781054541057592 
- 783876545840612 
- 784861687334472 
- 785208669629317 
- 785331191417285 
- 7853 74488842346 
- 785389793759148 


Ti 4) 


-000000000000000 
(333533333393 3333 
-480609266621545 
-396451590456853 
-415741434555130 
-413984394417376 
-414335276603458 
-414168137287926 
-414251685496006 
-414209909989944 


T(i,i) 


-000000000000000 


(iv) 


J_0(1.7) 


(v) pi/4 


(vi) 


(vii) 


sqrt (2) 


3/4 
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WwW 


2 0.600000000000000 0.800000000000000 
3 0.700000000000000 + 0.728888888888889 
4 0.750000000000000 0.769523809523810 
5 0.750000000000000 0.748489262371615 
6 0.750000000000000 0.750024807644598 
7 0.750000000000000 + 0.749999901904365 
8 0.750000000000000 0.750000000096088 
9 0.750000000000000 0.749999999999976 
10 0.750000000000000 0.750000000000000 

>> 
Comments 


. The Romberg scheme is effective here, since integration is over a smooth 


nonperiodic function. 


. Same as (1). 
. Same as (1). 
. Romberg is worse than the trapezoidal rule, because the integrand is a 


smooth periodic function with period z, and integration is over the full 
period. 


. Romberg only slightly better than the trapezoidal rule, and both converge 


slowly. The reason is the singularity at x = 1 (where the derivative is 
infinite). 

Romberg does not provide any improvement, since the derivative of f is 
discontinuous at an irrational point (2). 


. The trapezoidal rule, in contrast to the Romberg scheme, is exact after the 


third step, because 3/4 then becomes, and remains, a meshpoint, separating 
two linear pieces of /. Romberg, however, eventually catches up. 


Chapter 4 
Nonlinear Equations 


The problems discussed in this chapter may be written generically in the form 


f@) =0, (4.1) 


but allow different interpretations depending on the meaning of x and /f. The 
simplest case is a single equation in a single unknown, in which case f is a given 
function of a real or complex variable, and we are trying to find values of this 
variable for which f vanishes. Such values are called roots of the equation (4.1), or 
zeros of the function f . If x in (4.1) is a vector, say, x = [x1,%2,... ,Xa]’ € R4, 
and f is also a vector, each component of which is a function of d variables 
X1,X0,...,Xq, then (4.1) represents a system of equations. It is said to be a 
nonlinear system if at least one component of f depends nonlinearly on at least 
one of the variables x,, X2,... ,Xq. If all components of f are linear functions of 
X1,X2,... ,Xq, then we call (4.1) a system of linear algebraic equations, which (if 
d > 1) is of considerable interest in itself, but is not discussed in this chapter. Still 
more generally, (4.1) could represent a functional equation, if x is an element in 
some function space and f a (linear or nonlinear) operator acting on this space. In 
each of these interpretations, the zero on the right of (4.1), of course, has a different 
meaning: the number zero in the first case, the zero vector in the second, and the 
function identically equal to zero in the last case. 

Much of this chapter is devoted to single nonlinear equations. Such equations are 
often encountered in the analysis of vibrating systems, where the roots correspond 
to critical frequencies (resonance). The special case of algebraic equations, where 
f in (4.1) is a polynomial, is also of considerable importance and merits special 
treatment. Systems of nonlinear equations are briefly considered at the end of the 
chapter. 
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4.1 Examples 


4.1.1 A Transcendental Equation 


Nonalgebraic equations are referred to as being “transcendental.” An example is 
cos x coshx —1=0 (4.2) 


and is typical for equations arising in problems of resonance. Before one starts 
computing roots, it is helpful to gather some qualitative properties about them: 
are there any symmetries among the roots? How many roots are there? Where 
approximately are they located? With regard to symmetry, one notes immediately 
from (4.2) that the roots are located symmetrically with respect to the origin: if a is 
a root, so is —a@. Also, ~ = 0 is a trivial root (which is uninteresting in applications). 
It suffices therefore to consider positive roots. 

A quick way to get insight into the number and location of roots of (4.2) is to 
divide the equation by cos x and to rewrite it in the form 


coshx = (4.3) 


cosx- 


No roots are being lost by this transformation, since clearly cos x ~ 0 at any root 
x = a. Now one graphs the function on the right and the function on the left and 
observes where the two graphs intersect. The respective abscissae of intersection are 
the desired (real) roots of (4.2). This is illustrated in Fig. 4.1 (not drawn to scale). 
It is evident from this figure that there are infinitely many positive roots. Indeed, 
each interval [(2n — 5)m, (2n + 5)r], n = 1,2,3,..., has exactly two roots, 
a, < Bn, with aw, rapidly approaching the left endpoint, and £,, the right endpoint, 
as n increases. These account for all positive roots and thus, by symmetry, for all 
nonvanishing real roots. In applications, it is likely that only the smallest positive 
root, a, will be of interest. 


4.1.2 A Two-Point Boundary Value Problem 


Here we are looking for a function y € C”[0, 1] satisfying the differential equation 
y" = g(x,y, y’), O<x<1 (4.4) 
and the boundary conditions 


yO=y, yO=y, (4.5) 
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s 
- 


i=) 


‘0B yk 
fou ‘lal Lo o Bo: 

Xx 
—1 


Fig. 4.1 Graphical interpretation of (4.3) 


where g is a given (typically nonlinear) function on [0, 1] x R x R, and yo, y; are 
given numbers. At first sight, this does not look like a problem of the form (4.1), but 
it can be reduced to it if one introduces the associated initial value problem 
Wy / 
u =g(x,u,uv), O<x <1, 
u(0)= yo, u(O)=s, (4.6) 
where s (for “slope’”) is an unknown to be determined. Suppose, indeed, that for 


each s, (4.6) has a unique solution that exists on the whole interval [0,1]. Denote it 
by u(x) = u(x; s). Then problem (4.4), (4.5) is equivalent to problem 


u(1;s)—y1 = 0 (4.7) 


in the sense that to each solution of (4.7) there corresponds a solution of (4.4), (4.5) 
and vice versa (cf. Chap. 7, Sect. 7.1.2). Thus, by defining 


f(s) = ul; s)— v1, (4.8) 


we have precisely a problem of the form (4.1). It is to be noted, however, that f(s) 
is not given explicitly as a function of s; rather, to evaluate f(s) for any s, one has 
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to solve the initial value problem (4.6) over the whole interval [0,1] to find the value 
of u(x; s) at x = 1, and hence of f(s) in (4.8). 

A very natural way to go about solving (4.4), (4.5) is to evaluate f(s) for some 
initial guess s. If f(s) is, say, positive, we lower the value of s until we find one 
for which f(s) is negative. Then we have two slopes s: one that “overshoots” the 
target and one that “undershoots” it. We now take as our next aim the average 
of these slopes and “shoot” again. Depending on whether we hit above the target 
or below, we discard the first or second initial slope and continue to repeat the 
same procedure. In the terminology of boundary value problems, this is called the 
shooting method. To shoot is tantamount to solving an initial value problem for a 
second-order differential equation, which in fact is the equation of the trajectory a 
bullet would traverse if it were fired from a gun. In the terminology of this chapter, 
it is called the bisection method (cf. Sect. 4.3.1). 


4.1.3 A Nonlinear Integral Equation 


Suppose we want to find a solution y € C[0, 1] of the integral equation 


1 
y= f K(x,t) f(t, y@)dt = a(x), O<x <1, (4.9) 


where K, the “kernel” of the equation, is a given (integrable) function on [0, 1] x 
[0, 1], ff a given function on [0, 1] x R, typically nonlinear in the second argument, 
and a also a given function on [0, 1]. One way to approximately solve (4.9) is to 
approximate the kernel by a degenerate kernel, 


n 


K(x,t) © kn(x.t),  kn(x,t) = c(t); (X). (4.10) 


i=l 


We may think of the degenerate kernel as coming from truncating (to 7 terms) an 
infinite expansion of K(x, t) ina system of basis functions {z; (x)}, with coefficients 
c; depending only on ¢t. Replacing K in (4.9) by k, then yields an approximate 
solution y,, which is to satisfy 


1 
als) f kuleat)flt.yoO)at = a(x), OSeS1 AD 
0 
If we substitute (4.10) into (4.11) and define 


1 
a; =i ci(t) f(t, ya(t))dt, i =1,2,...,n, (4.12) 
0 
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we can write y, in the form 


Yn(X) = a(x) + > am (x). (4.13) 


i=1 


All that remains to be done is to compute the coefficients a; in this representation. 
It is at this point where one is led to a system of nonlinear equations. Indeed, by 
(4.12), the a; must satisfy 


1 n 
a — f ci(t)f(t.at) + >> ajmj(t))dt =0, 7 =1,2,....n, (4.14) 
0 =1 


j= 


where the left-hand sides are functions f; of @,@2,... ,@, which can be evaluated 
by numerical integration. It is seen how techniques of approximation (to obtain 
k,) discussed in Chap. 2, and techniques of integration (to compute the integrals 
in (4.14) discussed in Chap. 3, usefully combine to provide an approximate solution 
of (4.9). 


4.1.4 s-Orthogonal Polynomials 


In Ex. 20(b) of Chap. 3 we encountered an instance (s = 1) of “power orthogonal- 
ity,” that is, a (monic) polynomial z,, of degree n satisfying 


/ [zn (t)]°*! p(t)dA(t) = 0, all p € Py. (4.15) 
R 


This is called an s-orthogonal polynomial relative to the (positive) measure dA. We 
can reinterpret power orthogonality as ordinary orthogonality 


oars i y(t) p(t)n2*(t)dA(t) = 0, 


but relative to the (positive) measure dA‘ (t) = 775(t)dA(t) depending on z,,. Thus, 
orthogonality is defined implicitly. The point, however, is that if we denote by 
{Tkntiao the first n + 1 orthogonal polynomials relative to dA*, we have 1, = Inn, 


and we can formally generate z,,,, by a three-term recurrence relation: 
TWe+in(t) = (t — OK) Hen — BeTk-1n, &k=O0,1,...,n—-1, (4.16) 
where z~1,(t) = 0, 20n(t) = 1. The coefficients ap, a@1,... ,Q@n—13 Bo, Bi, --- 


B,—1 are unknown and must be determined. Here is how a system of 2 nonlinear 
equations can be constructed for them: 
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From Chap. 2, (2.40) and (2.41), one has 


= (tk.n ry Tk.n)aas 


ak = ; 
(kn »TTkn )aas 


(Tkn» TWkn)ars 
Bo = C, Daas, Be = "_  k=1,...,n—1. 
(1k-1n, Tk—1.n ads 


Consequently, clearing denominators, 


to = Bo =f mane = 0, 
foi t= [iw - t)m, , (t)m,,°, (t)da(t) =0,. v= 05 le..25 n—-1l, 
R 


hoy 3= [72.0 — 7, ,(t)\m,", (dA) = 0, v =1,...,2-1. (4.17) 
R 


Each of the m,,, v = 1,2,...,n, depends on ao,...,@)—1; B1,..., By—1 via the 
three-term recurrence relation (4.16). Therefore, we have 2n equations depending 
nonlinearly on the 2n unknowns q@,...,@n—13 Bo, --» + Bn—1: 


SF (p) = 9, p= [@o,...,Q@n—1; Bo,.--, Bn—1]. 


Since the components of f are integrals of polynomials of degree at most 
2(s + 1)n — 1, they can be computed exactly by an (s + 1)n-point Gauss quadrature 
rule relative to the measure dA (cf. MA 9). 


4.2 Iteration, Convergence, and Efficiency 


Even the simplest of nonlinear equations — for example, algebraic equations — are 
known to not admit solutions that are expressible rationally in terms of the data. It is 
therefore impossible, in general, to compute roots of nonlinear equations in a finite 
number of arithmetic operations. What is required is an iterative method, that is, a 
procedure that generates an infinite sequence of approximations, {x, }°2.9, such that 


lim x, =a (4.18) 


n—>oo 
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for some root a of the equation. In case of a system of equations, both x, and a are 
vectors of appropriate dimension, and convergence is to be understood in the sense 
of componentwise convergence. 

Although convergence of an iterative process is certainly desirable, it takes more 
than just convergence to make it practical. What one wants is fast convergence. A 
basic concept to measure the speed of convergence is the order of convergence. 


Definition 4.2.1. Linear convergence. One says that x, converges to a (at least) 
linearly if 
[Xn = a| < En, (4.19) 


where {€,,} is a positive sequence satisfying 


: Ent+l1 
lim 


noo Ep 


=c, 0<c<1. (4.20) 


If (4.19) and (4.20) hold with the inequality in (4.19) replaced by an equality, 
then c is called the asymptotic error constant. 


The phrase “at least’ in this definition relates to the fact that we have only 
inequality in (4.19), which in practice is all we can usually ascertain. So, strictly 
speaking, it is the bounds e€, that converge linearly, meaning that eventually (e.g., 
for n large enough) each of these error bounds is approximately a constant fraction 
of the preceding one. 

For linearly convergent sequences there is a simple device, called Aitken’s A?- 
process, that can be used to speed up convergence. One defines 


(Axn)* 
'=x,—- ; 4.21 
n= (4.21) 
where Ax, = Xn+1— Xn, A?X, = A(AXn) = Xn42 — 2Xn41 + Xn. The sequence 
{x/} then converges faster than {x,,} in the sense that 
x’ —a 
n >0 asn—>o, (4.22) 
Xn — a 


where a@ = limy-+99 X, (cf. Ex. 6). 


Definition 4.2.2. Convergence of order p. One says that x, converges to a with (at 
least) order p => 1 if (4.19) holds with 


‘ Ent+1 
lim 7 He 6 > 0, (4.23) 
noo &) 


(If p = 1, one must assume, in addition, c < 1.) 


Thus, convergence of order | is the same as linear convergence, whereas 
convergence of order p > 1 is faster. Note that in this latter case there is no 
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restriction on the constant c: once &, is small enough, it will be the exponent p 
that takes care of convergence. The constant c is again referred to as the asymptotic 
error constant if we have equality in (4.19). 

The same definitions apply also to vector-valued sequences; one only needs to 
replace absolute values in (4.19) by (any) vector norm. 

The classification of convergence with respect to order is still rather crude, as 
there are types of convergence that “fall between the cracks.” Thus, a sequence {é,} 
may converge to 0 more slowly than linearly, for example, such that c = 1 in (4.20). 
We may call this type of convergence sublinear. Likewise, c = 0 in (4.20) gives rise 
to superlinear convergence, if (4.23) does not hold for any p > 1 (cf. also Ex. 4). 

It is instructive to examine the behavior of €, if instead of the limit relations 
(4.20) and (4.23) we had strict equality from some n on, say, 


Ent+1 
Pp 
En 


=C, nNn=Nno,no+1,n9+2,.... (4.24) 


For o large enough, this is almost true. A simple induction argument then shows 
that 


yk 1 ; 
Enotk =CP 8 k= 0,1,2,..., (4.25) 


which certainly holds for p > 1, but also for p = 1 in the limit as p | 1: 
Enytk =C* tn, k=0,1,2,...(p =1). (4.26) 


Assuming then ¢,,, sufficiently small so that the approximation x,, has several 
correct decimal digits, we write €,,44 = 107% En). Then 6,, according to (4.19), 
approximately represents the number of additional correct decimal digits in the 
approximation X;)+% (as compared to x,,). Taking logarithms in (4.26) and (4.25) 
gives 


1 
klog- if p=1, 
c 
on = 
j= p—- i 1 
pt [SP tog 2 + = p08 — | if p> 1; 
p-l C Eng 
hence, as k — oo, 


be ~Cik (p=1), 8 ~ Cpp* (p> 1), (4.27) 


where C; = log + > Oif p = landC, = _ log + + log ae (We assume here 
that 79 is large enough, and hence ¢,,, small enough, to have c, > 0.) This shows 
that the number of correct decimal digits increases linearly with k, when p = 1, but 
exponentially when p > 1. In the latter case, 6,41/5, ~ p, meaning that ultimately 
(for large k) the number of correct decimal digits increases, per iteration step, by a 
factor of p. 
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If each iteration step requires m units of work (a “unit of work” typically is the 
work involved in computing a function value or a value of one of its derivatives), 
then the efficiency index of the iteration may be defined by limy—+o0[8,41/5,]!/" = 
p!/. It provides a common basis on which to compare different iterative methods 
with one another (cf. Ex. 7). Methods that converge linearly have efficiency index 1. 

Practical computation requires the employment of a stopping rule that terminates 
the iteration once the desired accuracy is (or is believed to be) attained. Ideally, one 
stops as soon as the absolute value (or norm) of the error x, — @ is smaller than a 
prescribed error tolerance. Since @ is not known, one commonly replaces x,, — a by 
Xy — Xn—1 and requires 


|| Xn — X=: | < tol, (4.28) 
where 
tol = ||xnlle- + €a (4.29) 


with €,, €, prescribed tolerances. As a safety measure, one might require (4.28) not 
just for one but for a few consecutive values of n. Choosing €, = 0 or €, = 0 will 
make (4.29) an absolute resp. relative error tolerance. It is prudent, however, to use 
a “mixed error tolerance,” say, €- = €, = €. Then, if ||x,|| is small or moderately 
large, one effectively controls the absolute error, whereas for ||x,,|| very large, it is 
in effect the relative error that is controlled. 
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Both these methods generate a sequence of nested intervals [a,,b,], n = 
0,1,2,..., whereby each interval is guaranteed to contain at least one root of 
the equation. As n — on, the length of these intervals tends to 0, so that in the limit 
exactly one (isolated) root is captured. The first method applies to any equation (4.1) 
with f a continuous function, but has no built-in control of steering the iteration to 
any particular (real) root if there is more than one. The second method does have 
such a control mechanism, but applies only to a restricted class of equations, for 
example, the characteristic equation of a symmetric tridiagonal matrix. 


4.3.1 Bisection Method 


We assume that two numbers a,b with a < b are known such that 


f€Cla,b], f(a)<0, f(b) >0. (4.30) 
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What is essential here is that f has opposite signs at the endpoints of [a, b]; the 
particular sign combination in (4.30) is not essential as it can always be obtained, 
if necessary, by multiplying f by —1. Assumptions (4.30), in particular, guarantee 
that f has at least one 0 in (a,b). 

By repeatedly bisecting the interval and discarding endpoints in such a manner 
that the sign property in (4.30) remains preserved, it is possible to generate a 
sequence of nested intervals whose lengths are continuously halved and each of 
which contains a zero of f. 

Specifically, the procedure is as follows. Define a; = a, b} = b. Then 


for n = 1,2,3,... do 
Xn = 5 (an + bn) 
if f(xn) < 0 then dy41 = Xy, bat1 = Dy else 
QAn+1 = an, bn+1 = Xn. 


Since b, — a, = 2-"-)(b — a), n = 1,2,3,..., and x, is the midpoint of 
[an, Dn], if w is the root eventually captured, we have 


1 b-a 
n— SS by — a) = > 4.31 
lan — a = 5 n — an) = (4.31) 
Thus, (4.19) holds with e, = 2~"(b — a) and 
Pe ail (4.32) 
En 


This shows that the bisection method converges (at least) linearly with asymp- 
totic error constant (for the bound ¢,,) equal to ‘. 

Given an (absolute) error tolerance tol > 0, the error in (4.31) will be less than 
or equal to tol if 


< tol. 
Qn 


Solved explicitly for n, this will be satisfied if 


log ro 
= “|, 4.33 
7 log 2 ( ) 


where [x] denotes the “ceiling” of x (i.e., the smallest integer > x). Thus, we know 
a priori how many steps are necessary to achieve a prescribed accuracy. 
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It should be noted that when f(x,,) approaches the level of machine precision, its 
sign may be computed incorrectly, hence the wrong half of the current interval may 
be chosen. Normally, this should be of little concern since by this time, the interval 
has already become sufficiently small. In this sense, bisection is a method that is 
robust at the level of machine precision. Problems may occur when ff is very flat 
near the root a, in which case, however, the root is numerically not well determined 
(cf. MA 1). 

When implementing the procedure on a computer, it is clearly unnecessary to 
provide arrays to store a, by, X,; one simply keeps overwriting. Assuming a and b 
have been initialized and tol assigned, one could use the following Matlab program 
(in symbolic/variable-precision mode to allow for high accuracies): 


SSBISEC Symbolic bisection method 


2 
6 


function [ntol,x]=sbisec(dig,a,b,tol) 
ntol=ceil (log((b-a)/tol) /log(2)); 
digits (dig) ; 
a=vpa(a); b=vpa(b); 
for n=1:ntol 
x=vpa((atb)/2); 
fx=subs (f (x)); 
if £x<0 
a=X; 
else 
b=x; 
end 
end 


function y=f (x) 
y=cos (x) xcosh(x) -1; 


As an example, we run the program to compute the smallest positive root of (4.2) 
that is, of f(x) = 0 with f(x) = cosxcoshx — 1 (see the subfunction appended 
to sbisec.m). By taking a = $n, b = 2n (cf. Fig. 4.1), we enclose exactly 
one root, a, from the start, and bisection is guaranteed to converge to a. This 
is implemented in the program TABLE4_3_1.m for tolerances tol = 5 x 1077, 
5 x 107), and 5 x 107%3; the results, rounded to the appropriate number of digits, 


STABLE4 3 1 Bisection method to solve Eq. (4.2) 
a=3*pi/2; b=2*pi; dig=35; 
for tol=[.5e-7 .5e-15 .5e-33] 
[n,x] =sbisec(dig,a,b,tol) 
end 
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Table 4.1 The bisection method applied to (4.2) 
n ay 
25 4.7300408 


52 4.730040744862704 
112 —4.730040744862704026024048 100833885 


are shown in Table 4.1'. As can be seen, bisection requires a fair amount of work 
(over a 100 iterations) to obtain a; to high precision. Had we run the program with 
initial interval [a,b], a = ; az, b = 4x, which contains three roots, we would have 
converged to the third root, @2, with about the same amount of work. 


4.3.2 Method of Sturm Sequences 


There are situations in which it is desirable to be able to select one particular 
root among many and have the iterative scheme converge to it. This is the case, 
for example, in orthogonal polynomials, where we know that all zeros are real 
and distinct (cf. Chap.3, Sect. 3.2.3(a)). It may well be that we are interested in 
the second-largest or third-largest zero and should be able to compute it without 
computing any of the others. This is indeed possible if we combine bisection with 
the theorem of Sturm.” 
Thus, consider 


f(x) = ma (x), (4.34) 


where zg is a polynomial of degree d orthogonal with respect to some positive mea- 
sure. We know (cf. Chap. 3, Sect. 3.2.3(e)) that zg is the characteristic polynomial 
of a symmetric tridiagonal matrix and can be computed recursively by a three-term 
recurrence relation 


mo(x)=1, m(x)=x-a, 
Tet i(X) = (x — ay) K(X) — Beae-1(x), k=1,2,...,d—-—1, (4.35) 
with all 6, positive. Recursion (4.35) not only is useful to compute zg (x) for any 


fixed x, but also has the following interesting property due to Sturm: Let o(x) be 
the number of sign changes (zeros do not count) in the sequence of numbers 


'The 33-digit result given in the first edition of this text is erroneous, being accurate only to 19 
digits after the decimal point. 


? Jacques Charles Francois Sturm(1803-1855), a Swiss analyst and theoretical physicist of Alsatian 
parentage, is best known for his theorem on Sturm sequences, discovered in 1829, and his theory 
of Sturm-—Liouville differential equations, published in 1834, which earned him the Grand Prix des 
Sciences Mathématiques. He also contributed significantly to differential and projective geometry. 
A member of the French Academy of Sciences since 1836, he succeeded Poisson in the chair of 
mechanics at the Ecole Polytechnique in Paris in 1839. 


4.3 The Methods of Bisection and Sturm Sequences 265 
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d 
O(x): d d-| d—2 


Fig. 4.2 Sturm’s theorem 


a(x), Ha—1(X),...,11(X), Wo(x). (4.36) 


Then, for any two numbers a,b with a < b, the number of real zeros of ma in the 
interval a < x < b is equal to a(a) — o(b). 

Since 1;,(x) = x* +--+, it is clear that o(—oo) = d, o(+00) = 0, so that 
indeed the number of real zeros of zg is a@(—oo) — o(+00) = d. Moreover, if 
&| > & >--- > &q denote the zeros of zg in decreasing order, we have the behavior 
of o(x) as shown in Fig. 4.2. 

It is now easy to see that 


o(x) <r—liff x >&,. (4.37) 


Indeed, suppose that x > &,. Then {#zeros < x} > d + 1—r; hence, by Sturm’s 
theorem, o(—co) — o(x) = d —o(x) = {#zeros < x} > d + 1-7; that is, 
o(x) < r—1. Conversely, if a(x) < r—1, then, again by Sturm’s theorem, {#zeros 
<x}=d-—o(x) >d+1-—r, which implies x > &, (cf. Fig. 4.2). 

The basic idea now is to control the bisection process not, as before, by checking 
the sign of z¢(x), but rather, by checking the inequality (4.37) to see whether we 
are on the right or left side of the zero &,. To initialize the procedure, we need 
two values aj = a, b} = b such thata < & and b > &. These are trivially 
obtained as the endpoints of the interval of orthogonality for zy, if it is finite. More 
generally, one can apply Gershgorin’s theorem? to the Jacobi matrix Ja associated 
with (4.35) (cf. Chap. 3, Sect. 3.2.3(e)) by recalling that the zeros of mg are precisely 
the eigenvalues of Jz. In this way, a can be chosen to be the smallest and b the 
largest of the numbers a + VBi, a, + (Br + V B2),---»d—2 (/Ba-2 + 
af. Ba—1), @a—1 + / Ba—1. The method of Sturm sequences then proceeds as follows, 
for any givenr with 1 <r <d: 


forn = 1,2,3,... do 


Xn = + (an + bn) 
ifo(xX,) >r—1 then ay4) = Xp, Dn41 = dy else 


Qn+1 = an, bn+t = Xn- 


3Gershgorin’s theorem states that the eigenvalues of a matrix A = [a;;] of order d are located in 
the union of the disks {z € C: |z—aj;| < rj}, i = 1,2,...d, where r) = D054; |aij|- 
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Since initially o(a) = d > r—1,0(b) = 0 < r —1, it follows by construction that 
o(a,)>r—1, o(b,)<r—-1, alln =1,2,3,..., (4.38) 


meaning that & € [a,,b,] for alln = 1,2,3,.... Moreover, as in the bisection 
method, b, — a, = 2—°—)(b — a), so that |x, — &| < ¢, with e, = 2-"(b — a). 
The method converges (at least) linearly to the root &. A computer implementation 
can be modeled after the one for the bisection method by modifying the if --else 
statement appropriately. 


4.4 Method of False Position 


As in the method of bisection, we assume two numbers a < b such that 


f €Cla,b], f(a) f(b) <0 (4.39) 


and generate a sequence of nested intervals [a,,b,], = 1,2,3,..., witha; =a, 
b, = b, such that f(a,) f(b,) < 0. Unlike the bisection method, however, we 
are not taking the midpoint of [a,,, b,] to determine the next interval, but rather the 
solution x = x, of the linear equation 


PCf; An, Dn; x) = 0, (4.40) 


where pi(f;4@n, Dn; -) is the linear interpolant of f at a, and b,. This would appear 
to be more flexible than bisection, as x, will come to lie closer to the endpoint at 
which | f| is smaller. Also, if f is a linear function, we obtain the root in one step 
rather than in an infinite number of steps. This explains the somewhat strange name 
given to this method (cf. Notes to Section 4.4). 

More explicitly, the method proceeds as follows: define a; = a, bj = b. Then, 


for n = 1,2,3,... do 
= an—bn 
il a F(an)— fbn) F(an) 


if I (Xn) f (Gn) > 0 then dn41 = Xn, ba+1 = by else 


Qn+1 = an; bn44 = Xn- 


One may terminate the iteration as soon as b, — a, < tol or | f(x,)| < tol, where 
tol is a prescribed error tolerance. 

As in the bisection method, when implementing the method on a computer, the a 
and b can be overwritten. On the other hand, it is no longer known a priori how many 
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iterations it takes to achieve the desired accuracy. It is prudent, then, to put a limit 
on the number of iterations. An implementation in (symbolic/variable-precision) 
Matlab may look as follows. 


S$SFALSEPOS Symbolic method of false position 


function [n,x]=sfalsepos(dig,a,b,tol,nmax) 
n=0; 
digits (dig) ; 
a=vpa(a); b=vpa(b); 
fa=f(a); fb=f£(b); 
while subs (b-a) >=tol 
n=n+1; 
if n>nmax 
fprintf(’n exceeds nmax’ ) 
return 
end 
x=a-(a-b)*fa/(fa-fb) ; 
f£x=f (x) ; 
if abs(subs(fx))<tol, return; end 
if subs (fxxfa)>0 
a=x; fa=fx; 
else 
b=x; fb=f£x; 
end 
end 


The convergence behavior is most easily analyzed if we assume that f is convex 
or concave on [a, b]. To fix ideas, suppose f is convex, say, 


f(x) >0 fora<x<b and f(a) <0, f(b)>0. (4.41) 


Then f has exactly one zero, a, in [a,b]. Moreover, the secant connecting f(a) 
and f(b) lies entirely above the graph of y = f(x) and hence intersects the 
real line to the left of aw. This will be the case for all subsequent secants, which 
means that the point x = b remains fixed while the other endpoint a gets contin- 
uously updated, producing a monotonically increasing sequence of approximations 
defined by 


xX, —b 


a n)s — el as eee Beene 4.42 
ja-fo7 4:42) 


Xn+1 = Xn 
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where x; = a. Any such sequence, being bounded from above by a, necessarily 
converges, and letting n — oo in (4.42) shows immediately that f(x,) — 0, that 
is, X, — a. To determine the speed of convergence, we subtract a from both sides 
of (4.42) and use the fact that f(a) = 0: 


n —b 
Xn+1 — A = Xz, — A — a [f(n) = S(@)]. 
Now divide by x, — @ to get 
Xn+1—@ _ 7. X,—b F(axn) — f(a) 
Xy-a ~~ f(%»)— f(b) xa 


Letting here n — oo and using the fact that x, — a, we obtain 


: Xn+1—@ f'(@) 
lim ——— =1-(b-a : 4.43 
Jim, b-a =e (4.43) 
Thus, we have linear convergence with asymptotic error constant equal to 
f' (a) 
c=1-(b-a) . (4.44) 
f(b) 


It is clear on geometric grounds that 0 < c < 1 under the assumptions made. 
An analogous result will hold, with a constant |c| < 1, provided f is either convex 
or concave on [a, b] and has opposite signs at the endpoints a and b. One of these 
endpoints then remains fixed while the other moves monotonically to the root a. 

If f does not satisfy these convexity properties on the whole interval but is such 
that f € C’[a,b] and f”(a) 4 0 at the root a eventually approached, then the 
convergence behavior described sets in for n large enough, since f” has constant 
sign in a neighborhood of a, and x, will eventually come to lie in this neighborhood. 

The fact that one of the “false positions” remains at a fixed point from some n 
on may speak against this method, especially if this occurs early in the iteration, or 
even from the beginning, as under the assumptions (4.41) made previously. Thus, 
in the case of (4.2), for example, when a = 3 xz and b = 27, we have f"(x) = 
—2sinx sinhx > 0 on [a,b], and f(a) = —1, f(b) = cosh(27) — 1 > 0, so that 
we are precisely in a case where (4.41) holds. Accordingly, we have found that to 
compute the root q, to within a tolerance of 0.5 x 1077, 0.5 x 107!°, and 0.5 x 10773, 
we now need respectively, 42, 87, and 188 iterations, as compared to 25, 52, and 112 
in the case of the bisection method. 

Exceptionally slow convergence is likely to occur when f is very flat near a, 
the point a is nearby, and b further away. In this case, b will typically remain fixed 
while a is slowly creeping toward a (cf. MA 1). 
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The secant method is a simple variant of the method of false position in which it is 
no longer required that the function f has opposite signs at the end points of each 
interval generated, not even the initial interval. In other words, one starts with two 
arbitrary initial approximations x9 # x, and continues with 


Xn — Xn-1 _ 
fafa ee =~ 


This precludes the formation of a fixed false position, as in the method of false 
positions, and hence suggests potentially faster convergence. On the other hand, we 
can no longer be sure that each interval [x,—1, x,] contains at least one root. It will 
turn out that, if the method converges, it does so with an order of convergence larger 
than 1 (but less than 2). However, it converges only “locally,” that is, only if the 
initial approximations xo, x; are sufficiently close to a root. 

This can be seen by relating the three consecutive errors of X;,41, X,, and X;,—1 as 
follows. Subtract w on both sides of (4.45), and use f(a) = 0, to get 


fn) 
[Xn—1 ’ Xn| f 


(: fn) — f@) 
(Xn ~~ a) [Xp—-1 ’ Xnl f 


= (=a) (1- Sel) 


[Xn—1, Xn] f — [Xn a] f 
[Xn—1, Xn] f , 


hence, by the definition of divided differences, 


Xn+1 = Xn — 


Xn+1 — A = Xp — A — 


II 
Fae 
* 

= 
| 
g 
— 


= (x, —a@) 


[Xn—-1, Xn, a] f 


[Xn-1, Xn] f 


This is the fundamental relation holding between three consecutive errors. 
From (4.46) it follows immediately that if w is a simple root, 


Xn+1— @ = (Xp, — &)(Xp—-1 — @) WS 1D acs (4.46) 


f@)=0, f'@) 40, (4.47) 


and if x, —> a, then convergence is faster than linear, at least if f € C 2 near a. 
Indeed, 

: Xn+1— a 

hie 0, 


n>oo X,—a 
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since the divided differences in the numerator and denominator of (4.46) converge 
to 5 f’(a) and f’(a), respectively. But just how fast is convergence? 

We can discover the order of convergence (assuming convergence) by a simple 
heuristic argument: we replace the ratio of divided differences in (4.46) by a 
constant, which is almost true if 7 is large. Letting then e, = |x; — a|, we have 

Cn+1 = enen-1° C, C>0. 
Multiplying both sides by C and defining E,, = Ce, gives 
En+1 = E,En-1, En > 0. 
Taking logarithms on both sides, and defining y, = log mn we get 


Yn+1 = Yn + Yn-1; (4.48) 


the well-known difference equation for the Fibonacci sequence. Its characteristic 
equation is t? —t — 1 = 0, which has the two roots f1, f2 with 


1 1 
i= 5 (t+ V5), ty = 5-5), 
and ft; > 1, |f2| < 1. The general solution of (4.48), therefore, is 
Yn = Cyt} + ot}, ci, C2 constant. 


Since y, — 00, we have cy # 0 and y, ~ cjt}' asin — oo, which translates into 
1 pt 
: - ~ Ce), and thus 
n 


a, fn, 
Cn+1 CXeh ty 


— ~ 
1 
en Celt 


=Ci! n> oo. 


The order of convergence, therefore, is t; = 5 d+ 5) = 1.61803... (the golden 
ratio). 
We now give a rigorous proof of this and begin with a proof of local convergence. 


Theorem 4.5.1. Let a be a simple zero of f. Let I, = {x € R: |x —a| < ¢} and 
assume f € C?[I,]. Define, for sufficiently small e, 


_ fs) 
Me (are) 


(4.49) 


Assume € so small that 


eM(s) <1. (4.50) 
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Then the secant method converges to the unique root a € I, for any starting values 
Xo #X, with xo € I,, xX, € Te. 


Note that lim,_,9 M(e) = fe | < ov, so that (4.50) can certainly be satisfied 
for ¢ small enough. The local nature of convergence is thus quantified by the 


requirement xo, x1 € Ig. 


Proof of Theorem 4.5.1. First of all, observe that a is the only zero of f in J,. This 
follows from Taylor’s formula applied at x = a: 


(x — a)? 


F(x) = f(@) + (& — a) f"@) + ae gel FG) 


where f(a) = 0 and & is between x and a. Thus, if x € J,, then so is &, and we 
have 


x-a f"(§) 
2 f'(@) 


fy) = (— afta) }1+ | fe k. 


Here, if x # a, all three factors are different from zero, the last one since by 
assumption 


x-a f"() 
2 f'(@) 


Thus, f on /, can only vanish at x = q@. 

Next we show that all x, € J, and two consecutive iterates are distinct, unless 
Ff (Xn) = 0 for some n, in which case x, = @ and the method converges in a finite 
number of steps. We prove this by induction: assume that x,-; € [;, X, € I, for 
some n and x, # X,—1. (By assumption, this is true form = 1.) Then, from known 
properties of divided differences, and by our assumption that f € C?[J,], we have 


<eM(e) < 1. 


1 : 
[Xn—1, Xn] f = 7G), [Xn—1, Xn, a] f = 2 St" (&), § € 1s 1= 1,2. 
Therefore, by (4.46), 


it) 
2f'(E1) 
showing that x,4; € J,. Furthermore, by (4.45), x,41 4 Xn, unless f(x,) = 0, 


hence x, = @. 
Finally, again using (4.46), we have 


2 


IXnt41—a| <e <e-eM(e) <e, 


[Xn41—@| < |x, -—aleM(e), n=1,2,3,..., 


272 4 Nonlinear Equations 


which, applied repeatedly, yields 
|x, —a| < [eM(e)]""'|x1 — I. 


Since eM(e) < 1, it follows that x, > @ asn —> oo. Oo 
We next prove that the order of convergence is indeed what we derived it to be 
heuristically. 


Theorem 4.5.2. The secant method is locally convergent with order of convergence 
(at least) p = 5 (1 + V5) = 1.61803... 


Proof. Local convergence is the content of Theorem 4.5.1 and so needs no further 
proof. We assume that xo, x; € J;, where € satisfies (4.50), and that all x, are 
distinct. Then we know that x, 4 @ for all n, and x, > @ asn —> oo. 

Now the number p in the theorem satisfies 


p=ptl. (4.51) 


From (4.46), we have 
|Xn+1 = a < [Xn —a| |Xn—1 —a| . M, (4.52) 


where we write simply M for M(e). Define 
E, = M|x,-—a|. (4.53) 
Then, multiplying (4.52) by M, we get 


En+1 = E,En-1. 


It follows easily by induction that 
E, < E”, E=max(E», E,’”). (4.54) 


Indeed, this is trivially true form = 0 andn = 1. Suppose (4.54) holds for n as well 
as forn — 1. Then 


n—1 n—1, 2 n+ 


=f Oa P= RP, 


En+i = Ey, En-1 < EP" EP 


where (4.51) has been used. This proves (4.54) form + 1, and hence for all 7. It now 
follows from (4.53) that 


1 n 
|X, —@ < En, én = 9 BP : 


4.5 Secant Method 273 


Since Ey = M|x9 —a@| < eM(e) < 1, and the same holds for E;, we have E < 1. 
Now it suffices to note that 


n+l 
Pp 
Entel _ ygp-t BY _ yg atn, 
e? Ep"? 
to establish the theorem. oO 


The method is easily programmed for a computer. In (symbolic/variable- 
precision) Matlab, for example, we could use the following program: 


SSSECANT Symbolic secant method 


2 
© 


function [n,x]=ssecant (dig,a,b,tol,nmax) 
n=0; 
digits (dig) ; 
a=vpa(a); b=vpa(b); 
fa=f£ (a); 
x=b; 
while subs (abs (x-a) ) >=tol 
n=n+1; 
if n>nmax 
fprintf(’n exceeds nmax’ ) 
return 
end 
b=a; 
fb=fa; 
a=X; 
fa=f (x) ; 
x=a-(a-b)*f£(a)/(fa-fb) ; 
end 


It is assumed here that a = x9, b = x; and that the iteration (4.45) is terminated 
as soon as |X;41 — Xn| < tol orm > nmax. The routine produces not only the 
approximation x to the root but also the number 7 of iterations required to obtain it 
to within an error of tol. 

Since only one evaluation of f is required in each iteration step (the statement 
fa=f (x) in the preceding program), the secant method has efficiency index p = 
1.61803... (cf. Sect. 4.2). 

To illustrate the considerable gain in speed attainable by the secant method, we 
again apply the routine to the equation (4.2) with x9 = ix, x; = 27. In view 
of the convexity of f, both x2. and x3 come to lie to the left of the roota = ay. 
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Lete =a — ; x and /, be the interval defined in Theorem 4.5.1. Its right endpoint 
is 2a — $m, which is <in. Indeed, tn — Qa — $7) = 22x -—a+ +3) = 
ae az —a) > 0, sincea < 2 zt on account of f(2 wz) = 30.545... > 0. On the 


interval [3 I, i zc], and hence also on J,, we have 


f"(s) 
2f') 


2 sinh (4 zr) 


= 1.0965..., 
~ 2 cosh (3 1) 


which, multiplied by ‘ mz — the length of the interval — is 0.86121... < 1. All the 
more, therefore, eM(e) < 1. By Theorem4.5.1 and its proof, it follows that all 
subsequent approximations x3, X4,... remain in this interval and converge to a, 
the order of convergence being p = s(1 + ./5) by Theorem4.5.2. To obtain 
the root a; to within the three tolerances used earlier, we find that 6, 7, and 9 
iterations, respectively, suffice. This should be contrasted with 25, 52, and 112 
iterations (cf. Sect. 4.3.1) for the bisection method. 

In contrast to bisection, the secant method is not robust at the level of machine 
precision and may even fail if fa happens to become equal to £b. The method is, 
therefore, rarely used on its own, but more often in combination with the method of 
bisection,; see, for example, Dekker [1969] and Brent [2002, Chap. 4]. 


4.6 Newton’s Method 


Newton’s method can be thought of as a limit case of the secant method (4.45) if we 
let there x,—; move into x,,. The result is 


x, 
Xn41 =X — At n) n=0,1,2,..., (4.55) 
f'n) 
where Xo is some appropriate initial approximation. Another more fruitful in- 
terpretation is that of linearization of the equation f(x) = 0 atx = x,. In 


other words, we replace f(x) for x near x, by the linear approximation f(x) ~ 
FSalx) = f(Xn) + (x — Xn) f’ (Xn) obtained by truncating the Taylor expansion of 
f centered at x, after the linear term and then solve the resulting linear equation, 
fa(x) = 0, calling the solution x,+ . This again leads to (4.55). Viewed in this 
manner, Newton’s method can be vastly generalized to nonlinear equations of all 
kinds, not only single equations as in (4.55), but also systems of nonlinear equations 
(cf. Sect. 4.9.2) and even functional equations, in which case the derivative f’ is to 
be understood as a Fréchet derivative. 

It is useful to begin with a few simple examples of single equations to get a feel 
for how Newton’s method may behave. 
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Ay 


Fig. 4.3. A cycle in Newton’s method 


Example. The square root a = ./a, a > 0. The equation here is f(x) = 0 with 
f(x) =x? —a. (4.56) 


Equation (4.55) then immediately gives 


1 
aS (» 4 =) . E017. (4.57) 
2 ber 


(a method already used by the Babylonians long before Newton). Because of the 
convexity of f it is clear that iteration (4.57) converges to the positive square root 
for each xp > 0 and is monotonically decreasing (except for the first step in the case 
0 < xo < a). We have here an elementary example of global convergence. 


Example. f(x) = sin x, |x| < 5 mz. There is exactly one root in this interval, the 


trivial root a = 0. Newton’s method becomes 
Xn+1 = Xp —tanx,, n=0,1,2,.... (4.58) 
It exhibits the following amusing phenomenon (cf. Fig. 4.3). If x9 = x*, where 
tanx* = 2x", (4.59) 


then x, = —x*, x7 = x*, that is, after two steps of Newton’s method we end up 
where we started. This is called a cycle. 

For this starting value, Newton’s method does not converge, let alone to a = 0. 
It does converge, however, for any starting value xo with |xo| < x*, generating 
a sequence of alternately increasing and decreasing approximations x, converging 
necessarily to aw = 0. The value of the critical number x* can itself be computed by 
Newton’s method applied to (4.59). The result is x* = 1.16556... . In a sense, we 
have here an example of local convergence, since convergence cannot hold for all 


Xo € [-4 I, 5 wr]. (If xo = 5 qt, we even get thrown off to oo.) 
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Example. f(x) = x?°—1, x > 0. This has exactly one positive (simple) root a = 1. 
Newton’s method yields the iteration 


19 
an Xn + 


0 20x19’ nN =/0) 123s 5 (4.60) 


Xn+1 = 


which provides a good example to illustrate one of the dangers in Newton’s method: 
unless one starts sufficiently close to the desired root, it may take a long while to 
approach it. Thus, suppose we take x9 = 5, then x; ~ me = 2.62144 x 10+, a huge 
number. What is worse, it is going to be a slow ride back to the vicinity of a = 1, 


since for x, very large, one has 


1 
Xn+1 © 70°" Xn > 1. 


At each step the approximation is reduced only by a fraction p = 0.95. It takes 
about 200 steps to get back to near the desired root. But once we come close to 
a = I, the iteration speeds up dramatically and converges to the root quadratically 
(see Theorem 4.6.1). Since f is again convex, we actually have global convergence 
on R+, but, as we have seen, this is of little comfort. 


Example. Let f € C*[a,b] be such that 


fis convex (or concave) on [a, b]; 
the tangents at the endpoints of [a, b] 
intersect the real line within [a, b]. 


In this case, it is clear on geometric grounds that Newton’s method converges 
globally, that is, for any xo € [a,b]. Note that the tangent condition in (4.61) is 
automatically satisfied at one of the endpoints. 


The following is a (symbolic/variable-precision) Matlab routine implementing 
Newton’s method and returning not only an approximation x to the root, but also 
the number 7 of iterations required to obtain it. The initial approximation is input 
by a, and tol is the error tolerance. For reasons of safety, we limit the number of 
iterations to nmax. Two function routines £, £d evaluating f and f’ have to be 
appended as subfunctions. 


SSNEWTON Symbolic Newton’s method 


function [n,x]=snewton(dig,a,tol,nmax) 
n=0; 

digits (dig) ; 

a=vpa(a); 

xO=at+l1l; xX=a; 
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while subs (abs (x-x0) ) >=tol 
n=n+1; 
if n>nmax 
fprintf(’n exceeds nmax’ ) 
return 
end 
xO=X; 
x=x0-f£ (x0) /£d (x0); 
end 


function y=f (x) 
y=cos (x) *cosh(x)-1; 


function y=fd(x) 
y=-Sin(x) «cosh (x) +cos (x) *sinh(x) ; 


Our test equation (4.2) provides a good illustration for (4.61). The function 
F(x) = cosx coshx — | is convex on [5 zt, 27] and the tangents at both endpoints 
intersect the real line inside the interval [3 wt, 21]. This is obvious, by convexity, for 
the right endpoint, and for the left ehdpaitit the tangent intersects the real line at 
x= = 4.7303... < 27. Moreover, since the point of intersection 


3 1 
2 w+ cosh(3z/2) 
of the tangent at the right sagen eS frsinbea ach itth = 5.2869... is to 


the right of x2 Newton’s method converges faster if started at the left endpoint and 
yields a; in 4, 5, and 6 iterations for the three tolerances 0.5 x 1077, 0.5 x 107'°, 
and 0.5 x 107°, respectively. This is slightly, but not much, faster than the secant 
method. We see shortly, however, that the efficiency index is smaller for Newton’s 
method than for the secant method. 

To study the error in Newton’s iteration, subtract w — a presumed simple root of 
the equation — from both sides of (4.55) to get 


Xn4+1—-A =X, -—A— FS (xn) 
Fe (Xn) 
= (%, — 0 (1-7 fOn) — f@) se 
Xn — a) f! (Xn) 
[xn al f 
(xX, — a) (1- fe ar) peti 
—— a (4.62) 


Therefore, if x, — a, then 


2 Xn+1 —@ = f"(@) 
im. tema Opal’ (4.63) 
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that is, Newton’s method converges quadratically if f” (a) 4 0. Since it requires at 
each step one function evaluation and one derivative evaluation, the efficiency index 
is /2 = 1.41421... , which is less than the one for the secant method. Formula 
(4.63) suggests writing the equation f(x) = 0 in alternative, but equivalent, ways 
so as to reduce the asymptotic error constant on the right of (4.63). 

The proof of local convergence of Newton’s method is virtually the same as the 
one for the secant method (cf. Theorem 4.5.1). We only state the result and refer to 
Ex. 28 for a proof 


Theorem 4.6.1. Let a be a simple root of the equation f(x) = O and let I, = {x € 
R: |x —a| < e}. Assume that f € C?[I,]. Define 


f"s) 
M(e) = max | : (4.64) 
BP PPO 
If ¢ is so small that 
2eM(e) < 1, (4.65) 


then for every Xo € I, Newton’s method is well defined and converges quadratically 
to the only root a € I,. (The extra factor 2 in (4.65) comes from the requirement 
that f’(x) 4 0 for x € I,.) 


An interesting variant of Newton’s method is Steffensen’s method 


= I (Xn) g(x )= I (Xn + f(Xn)) — Fn) 
&(Xn) : f (Xn) 


Xn+1 = Xn (4.66) 
which shares with Newton’s method the property of second-order convergence but 


does not require the derivative of f. Instead, it replaces f’(x,) in (4.55) by the 
difference quotient g(x,) = [f(% + hn) — f(Xn)|/ hn, where h, = f(x). 


4.7 Fixed Point Iteration 


Often, in applications, a nonlinear equation presents itself in the form of a fixed 
point problem: find x such that 

x = ((x). (4.67) 
A number @ satisfying this equation is called a fixed point of gy. Any equation 
F(x) = 0, in fact, can (in many different ways) be written equivalently in the form 
(4.67). For example, if f’(x) 4 0 in the interval of interest, we can take 


F(x) 
F(x) 


g(x) =x- : (4.68) 
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If xo is an initial approximation of a fixed point a of (4.67), the fixed point 
iteration generates a sequence of approximants by 


Xn41 = @(Xn), n=O0,1,2,.... (4.69) 


If it converges, it clearly converges to a fixed point of g if g is continuous. Note 
that (4.69) is precisely Newton’s method for solving f(x) = 0 if g is defined by 
(4.68). So Newton’s method can be viewed as a fixed point iteration, but not the 
secant method (why not?). 

For any iteration of the form (4.69), assuming that x, — a asn — oo, it is 
straightforward to determine the order of convergence. Suppose, indeed, that at the 
fixed point a we have 


g'(@) = g"(@) =: = 9" Ya) =0, oa) £0. (4.70) 


(We tacitly assume gy € C” near a.) This defines the integer p > 1. We then have 
by Taylor’s theorem 


= 1 
Q(Xn) = y(a) + (Xn —a)y' (a) fees easy = gi (p- Dia) 
v=)! 
(Xp —a) (x Xn — a)? 
rs 


Pp 
+ En) = 9() + En), 


! 


where &, is between aw and x,. Since @(x,) = X41 and g(a) = a, we get 
Xn+1— a 1 
_ p! p(E,). 


(Xn — a)? 


AS X, — @, since &, is trapped between x, and a, we conclude, by the continuity 
of y"?) at a, that 


m Aap = 1 (a) # 0. (4.71) 
eae OO (Xn -— a)? p! 


This shows that convergence is exactly of the order p, and 
Pa) (4.72) 


is the asymptotic error constant. Combining this with the usual local convergence 
argument, we obtain the following result. 


Theorem 4.7.1. Let a be a fixed point of y and I, = {x € R: |x—a| < €}. Assume 
gy € C?{I,] satisfies (4.70). If 


M(e):= max ly’(t)| < 1, (4.73) 
tel, 
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then the fixed point iteration (4.69) converges to a for any Xp € I,. The order of 
convergence is p and the asymptotic error constant given by (4.72). 


Applying the theorem to (4.68) recovers second-order convergence of Newton’s 
method. Indeed, with g given in (4.68), we have 


Lf’ QP — FOS") 
[f’COP 


hence g’(a) = 0 Cif f’(a) # 0), and 


FU) 


a moe 


= f(x) 


f" (x) )+ LU) | 


#'@) = £@) (op zee 


hence v” (a) = ae # 0, unless f’” (a) = 0. In the exceptional case f’” (a) = 0, 


Newton’s method converges cubically (at least). 


4.8 Algebraic Equations 


There are many iterative methods specifically designed to solve algebraic equations. 
Here we only describe how Newton’s method applies in this context, essentially 
confining ourselves to a discussion of an efficient way to evaluate simultaneously 
the value of a polynomial and its first derivative. In the special case where all zeros 
of the polynomial are known to be real and simple, we describe an improved variant 
of Newton’s method. 


4.8.1 Newton’s Method Applied to an Algebraic Equation 


We consider an algebraic equation of degree d, 
fx) =0, fe) =x4 +agixt! +--+, (4.74) 


where the leading coefficient is assumed (without restricting generality) to be 1 and 
where we may also assume dy) 4 O without loss of generality. For simplicity we 
assume all coefficients to be real. 

To apply Newton’s method to (4.74), one needs good methods for evaluating a 
polynomial and its derivative. Underlying such methods are division algorithms for 
polynomials. Let t be some parameter and suppose we want to divide f(x) by x—t. 
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We write 
FO) = =O + dyna? +--+ +.) + Bo (4.75) 


and compare coefficients of powers of x on both sides. This leads immediately to 
the equations 


bg = 1, 
be = they, tag, k=d-—1,d—-2,...,0. (4.76) 


The coefficients b; so determined of course depend on f; indeed, b, is a polynomial 
in t of degree d — k. We now make three useful observations: 


(a) From (4.75), we note that b) = f(t). Thus, in (4.76), we have an algorithm 
to evaluate f(t) for any given t. It is known as Horner’s scheme, although 
Newton already knew it. It requires d multiplications and d additions, which 
is more efficient than the naive way of computing f, which would be to first 
form the successive powers of ¢ and then multiply them into their coefficients. 
This would require twice as many multiplies as Horner’s scheme, but the same 
number of additions. It is an interesting question of complexity whether the 
number of multiplications can be further reduced. (It is known by a theorem of 
Ostrowski that the number d of additions is optimal.) This indeed is possible, 
and schemes using less than d multiplications have been developed by Pan 
and others. Unfortunately, the reduction in complexity comes with a price of 
increased numerical instability. Horner’s scheme, therefore, is still the most 
widely used technique for evaluating a polynomial. 

Suppose t = a, where a is a zero of f. Then bo = 0, and (4.75) allows division 
by x — a without remainder: 


(b 


we 


fx) 


x= 


xo) 4 by xt tet by = (4.77) 


This is the deflated polynomial, in which the zero a has been “removed” from f/f. 
To compute its coefficients, therefore, all we need to do is apply Horner’s 
scheme with tf = a. This comes in very handy in Newton’s method when 
f is evaluated by Horner’s scheme: once the method has converged to a root 
a, the final evaluation of f at (or very near) a@ automatically provides us 
with the coefficients of the deflated polynomial, and we are ready to reapply 
Newton’s method to this deflated polynomial to compute the remaining zeros 
of f. 


(c) By differentiating (4.75) with respect to x and then putting x = ft, we obtain 


Piaf 4 bel +o by, (4.78) 
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Thus, we can apply Horner’s scheme again to this polynomial to evaluate 
J’ (t). Both applications of Horner’s scheme are conveniently combined into the 
following double Horner scheme: 


bg =1, cg=1 


” 47 
be =thesi tae pg 1 g_2.....10, Gt) 
Ck = teegi + de 


Then 
fO=b, f'QH=c. (4.80) 


The last step in (4.79) for co is actually redundant, but it does not seem worth the 
complication in programming to eliminate this extra step. 


We now have a convenient way to compute f(x,) and f’(x,) in each Newton 
step Xn+1 = Xn — on apply the algorithm (4.79) with tf = x, and use (4.80). 
Once x, has converged to a, the b generated in (4.79) give us the coefficients of 
the deflated polynomial (4.77) and we are ready to reapply Newton’s method to the 
deflated polynomial. 

It is clear that any real initial approximation xo generates a sequence of real 
iterates x, and, therefore, can only be applied to compute real zeros of f (if any). 
For complex zeros one must start with a complex xo, and the whole computation 
proceeds in complex arithmetic. It is possible, however, to use division algorithms 
with quadratic divisors to compute quadratic factors of f entirely in real arithmetic 
(Bairstow’s method). See Ex. 44 for details. 

One word of caution is in order when one tries to compute all zeros of f 
successively by Newton’s method. It is true that Newton’s method combined with 
Horner’s scheme has a built-in mechanism of deflation, but this mechanism is valid 
only if one assumes convergence to the exact roots. This of course is impossible, 
partly because of rounding errors and partly because of the stopping criterion used 
to terminate the iteration prematurely. Thus, there is a build-up of errors in the 
successively deflated polynomials, which may well have a significant effect on 
the accuracy of the respective roots (cf. Chap. 1, Sect. 1.3.2(2)). It is therefore 
imperative, once all roots have been computed, to “purify” them by applying 
Newton’s method one more time to the original polynomial f in (4.74), using the 
computed roots as initial approximations. 


4.8.2 An Accelerated Newton Method for Equations 
with Real Roots 


If (4.74) has only real distinct roots, 


f(a) =0, ag < ag) <+++ <a) <Q, (4.81) 
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one can try to speed up Newton’s method by approaching each root from the right 
with double the Newton steps until it is overshot, at which time one switches back 
to the ordinary Newton iteration to finish off the root. 

Underlying this method is the following interesting theorem. 


Theorem 4.8.1. Let f be a polynomial having only real zeros as in (4.81), and let 
at, be the largest zero of f'. Then for every z > a, defining 


,_,_ F@ pee ee Pp 
a> cs Foy 


(4.82) 


one has 
a<y, a<y'<z. (4.83) 


The theorem suggests the following algorithm: start with some x9 > a, and 
apply 


f (xk) 
Xp+] = Xp —-2 5 = 05 1,2, <0. (4.84) 
f' (xx) 
Then there are the following possibilities. 
(i) We have x9 > x1 > X2 > ++: > a and x, | a; as k — oo. Since we use 


double Newton steps in (4.84), convergence in this case is faster than for the 
ordinary Newton iteration. Note also that f(x,) > 0 for all k. 
(ii) There exists a first index k = ko such that 


F(xo) f(x~) > 0 for O< k < ko, f(X0) f (Xe) < 0. 


Then y := Xx, is to the left of a; (we overshot) but, by (4.83), to the right of a. 
Using now y as the starting value in the ordinary Newton iteration, 


f(x) 


, k=0,1,2..., 
S' (ye) 


Yo= Ys Yeti = Vk — 


brings us back to the right of a in the first step and then monotonically down 
to a. 


In either case, having obtained a, we apply the same procedure to the deflated 
polynomial f\(x) = Le) to compute the next smaller zero. As starting value 
we can take a, or better, if case (11) has occurred, y = x;,. The procedure can 
obviously be continued until all roots have been computed. 
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4.9 Systems of Nonlinear Equations 


Although most of the methods discussed earlier allow extensions to systems of 
nonlinear equations, 


f(x)=0, ff: RIOR’, (4.85) 


we consider here only two such methods: the fixed point iteration and Newton’s 
method. 


4.9.1 Contraction Mapping Principle 


We write (4.85) in fixed point form, f (x) = x — g(x), and consider the fixed point 
iteration 


Xn41 = Q(xX,), n=O0,1,2,.... (4.86) 
We say that g : R? — R¢ is a contraction map (or is contractive) on a set D C R4 


if there exists a constant y with 0 < y < 1 such that, in some appropriate vector 
norm, 


lo(x) — o(x*)|| < yllx —x*|| forallx eD, x* € D. (4.87) 


Theorem 4.9.1. (Contraction Mapping Principle). Let D ¢ R¢@ be a complete 
subset of R¢ (i.e., either bounded and closed, or all of R“). If 9 : R4 — R4 is 
contractive in the sense of (4.87) and maps D into D, then 


(1) iteration (4.86) is well defined for any x9 € D and converges to a unique fixed 


pointa € D, 
lim x, =a; (4.88) 
no 
Gi) forn = 1,2,3,... there holds 
[Xn — || < — |x: — xl (4.89) 
and 
Xn —a@|| < y"||xo — el. (4.90) 


Proof. (i) Since g(D) C D, iteration (4.86) is well defined. We have, for 
[eal Wee eee 


Xn+1 — Xnll = een) — Pn-1) |] < Ven — Xn-1 |]. 
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Repeated application of this yields 
n 
I|Xn41 —Xnl| < y" [x1 —xoll, 
and hence, since 
Xnt+p— Xn = (Xn+p = Xn+p-1) =F (Xn+p-1 aa Xntp—2) +++ (Xn41—Xn), 


more generally, 


P Pp 
|Xn+tp = X»|| = = |Xn+K _ Xn+k-1| < ~ a \|x1 = Xoll 
k=1 k=1 


lo, <} 
n = Y 
<0 la — x0ll = 7 
k=1 


\|x1 — Xo. (4.91) 


Since y” — 0, it follows that {x,,} is a Cauchy sequence in D, and hence, since 
D is complete, converges to some a € D, 


lim x, > a. 
n-*Oo 


The limit « must be a fixed point of @ since 


I|xn — P(@)I = ](en-1) — PC) < y[lxn-1 — el, (4.92) 


hence a = limy-+o0 X» = g(a). Moreover, there can be only one fixed point in 
D, since a = g(a), a* = p(a*), anda € D, a* € D imply lla —a@*|| = 
|p(a)—p(e*)|| < y|lae—oe* |], thatis, (1—y)||e—a*|| < 0, and hence w = a”, 
since 1—y > 0. 

(ii) Letting p — o in (4.91) yields the first inequality in Theorem 4.9.1 (ii). The 
second follows by a repeated application of (4.92), since g(a) = a. O 


Inequality (4.90) shows that the fixed point iteration converges (at least) linearly, 
with an error bound having asymptotic error constant equal to y. 


4.9.2 Newton’s Method for Systems of Equations 


As we mentioned earlier, Newton’s method can be easily adapted to deal with 
systems of nonlinear equations, reducing the nonlinear problem to an infinite 
sequence of linear problems, that is, systems of linear algebraic equations. The tool 
is linearization at the current approximation. 
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Thus, given the equation (4.85), written more explicitly as 
fi(xl,x?,...,x9)=0, 1 =1,2,...,d (4.93) 
(where components are indexed by superscripts), and given an approximation x9 to 


a solution w € R@, the ith equation in (4.93) is linearized at x = xo by truncating 
the Taylor expansion of f' at xo after the linear terms. This gives 


d : 
: of! : ; 
F' &o) + ) oF (eay(x! — xh) = 0, i eee o 
j= 


or, written in vector form, 


0 
Sf (x0) + F (xo)(x — Xo) = 0, (4.94) 
where 
of = on ; 4.95 
ax = E ©} (4.95) 


is the Jacobian matrix of f . This is the natural generalization of the first derivative 
of a single function to systems of functions. The solution x of (4.94) — a system 
of linear algebraic equations — will be taken to be the next approximation. Thus, in 
general, starting with an initial approximation x9, Newton’s method will generate a 
sequence of approximations x,, € R“ by means of 


Fx, )An = —f (Xn), 
w= 012 ys2:, (4.96) 


Xnt1 = Xn + An, 


where we assume that the matrix (0 f /dx)(x,,) in the first equation is nonsingular 
for each n. This will be the case if (0 f /dx)(a) is nonsingular and xo is sufficiently 
close to w, in which case one can prove as in the one-dimensional case d = | that 
Newton’s method converges quadratically to o, that is, ||x,+1—a@|| = O(||xn—e|7) 
as Nn —> Oo. 

Writing (4.96) in the form 


af a 
Xnt1 =Xn— Dy rn) f(xn), n=0,1,2,..., (4.97) 


brings out the formal analogy with Newton’s method (4.55) for a single equation. 
However, it is not necessary to compute the inverse of the Jacobian at each step; it 
is more efficient to solve the linear system directly as in (4.96). 
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There are many ways to modify the initial stages of Newton’s method (for 
systems of nonlinear equations) to force the iteration to make good progress toward 
approaching a solution. These usually go under the name quasi-Newton methods 
and they all share the idea of employing a suitable approximate inverse of the 
Jacobian rather than the exact one. For these, and also for generalizations of the 
secant method and the method of false position to systems of nonlinear equations, 
we refer to specialized texts (cf. the Notes to Chap. 4). 


4.10 Notes to Chapter 4 


Texts dealing largely with single nonlinear equations are Traub [1964], House- 
holder [1970], and Brent [2002]. Traub’s book provides a systematic treatment 
of iterative methods, both old and new. Although it deals also with questions of 
convergence, the emphasis is on the derivation, classification, and cataloguing of 
iterative methods. Householder’s book is strong on algebraic and analytic tools 
(often rooted in nineteenth-century mathematics) underlying numerical methods 
and less so on the methods themselves. Brent’s book, although mainly devoted 
to optimization, gives a detailed account, and a computer program, of an algo- 
rithm essentially due to Dekker, combining the secant method with bisection. 
Other well-tested algorithms available in software are, for example, the IMSL 
routine zreal, implementing Muller’s method (Muller [1956]), and zporc for 
(real) algebraic equations, implementing the Jenkins—Traub three-stage algorithm 
(Jenkins and Traub [1970]). The routine zplrc, based on Laguerre’s method 
(cf., e.g., Froberg [1985, Sect. 11.5]), also finds complex roots. The computation 
of zeros of analytic functions is treated in Kravanja and Van Barel [2000]. 

For the solution of systems of nonlinear equations, the books by Ortega and 
Rheinboldt [2000] and Rheinboldt [1998] give well-organized, and perhaps the 
most comprehensive, descriptions and mathematical analyses of nonlinear iterative 
methods. A less exhaustive, but mathematically penetrating, text, also dealing 
with operator equations in Banach space, is Ostrowski [1973]. At a lower level 
of mathematical sophistication, but richer in algorithmic details, is the book by 
Dennis and Schnabel [1996], which gives equal treatment to nonlinear equations 
and optimization. The books by Kelley [1995, 1999] dealing with iterative methods 
for linear and nonlinear systems of equations and for optimization, as well as 
Kelley [2003] specifically for Newton-type methods, provide e-mail and Web 
site addresses for respective Matlab codes. Complexity issues are discussed in 
Sikorski [2001] and iterative regularization methods for nonlinear inverse problems 
in Kaltenbacher et al. [2008]. A useful guide on available software for problems in 
optimization (including nonlinear equations) is Moré and Wright [1993]. For older 
precomputer literature, the two-volume treatise by Durand [1960, 1961] is still a 
valuable source. 
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Section 4.1. A sampling of typical systems of nonlinear equations occurring in 
applied analysis is given in Ortega and Rheinboldt [2000, Chap. 1]. Many nonlinear 
equation problems arise from minimization problems, where one tries to find a 
minimum of a function of several variables. If it occurs at an interior point, then 
indeed the gradient of the function must vanish. 


Section 4.1.2. For more on shooting methods, see Chap. 7, Sect. 7.2. 


Section 4.1.3. Another, even simpler, way of approximating the solution y of (4.9) 
by y, is to apply an n-point quadrature rule to the integral on the left, thus writing 
Vn(x) = a(x) + a weK (x, te) f (te, Yn(te)), O < x < 1, and determining 
{Yn (tke) fi, by putting x = t;,i = 1,2,...,n, in this equation. Again, we are led 
to a system of nonlinear equations. 


Section 4.1.4. This example is taken from Gautschi and Milovanovicé [1997], where 
one also finds a discussion on how to compute the Turan-type quadrature rule 
associated with an s-orthogonal polynomial z,. This is the quadrature rule of 
maximum degree of exactness that involves a function and its derivatives up to order 
2s evaluated at the zeros of z,. 


Section 4.2. More refined measures for the speed of convergence are defined in 
Brent [2002, p. 21] and, especially, in Ortega and Rheinboldt [2000, Chap. 9], where 
the concepts of Q-order and R-order of convergence are introduced, the former 
relating to quotients of errors, as in (4.23), the latter to nth roots el! "Also see 
Potra [1989]. These are quantities that characterize the asymptotic behavior of the 
error as m —> ov; they say nothing about the initial behavior of the error. If one 
wants to describe the overall behavior of the iteration, one needs a function, not a 
number, to define the rate of convergence, for example, a function (if one exists) 
that relates ||X,+1 — Xn|| to ||. — Xn—1|| (cf. Ex. 25(b)). This is the approach taken 
in Potra and Ptak [1984] to describe the convergence behavior of iterative processes 
in complete metric spaces. For similar ideas in connection with the convergence 
behavior of continued fractions, also see Gautschi [1983]. 

The efficiency index was introduced by Ostrowski [1973, Chap. 3, Sect. 11], who 
also coined the word “horner” for a unit of work. 

Aitken’s A?-process is a special case of a more general nonlinear acceleration 
method, the Shanks transformation, or, in Wynn’s recursive implementation, the 
epsilon algorithm, see Brezinski and Redivo-Zaglia (1991, pp. 78-95). 


Section 4.3.2. For Sturm sequences and Sturm’s theorem see, for example, 
Henrici [1988, p.444ff], and for Gershgorin’s theorem, Golub and Van Loan 
[1996, p.320]. The bisection method based on Sturm sequences, in the context 
of eigenvalues of a symmetric tridiagonal matrix, is implemented in the Eispack 
routine BISECT (Smith et al. [1976, p. 211)). 


Section 4.4. The method of false position is very old, originating in medieval Arabic 
mathematics, and even earlier in fifth-century Indian texts (Plofker [1996, p. 254]). 
Leonardo Pisano (better known as “Fibonacci’”’), in the thirteenth century, calls 
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it “regula duarum falsarum positionum,’ which, in the sixteenth and seventeenth 
centuries, became abbreviated to “regula positionum’ or also “regula falsi.’ Peter 
Bienewitz (1527), obviously having a linear equation in mind, explains the method 
in these (old German) words: “Vnd heisst nit darum falsi dass sie falsch vnd unrecht 
wehr, sunder, dass sie auss zweyen falschen vnd vnwahrhaftigen zalen, vnd zweyen 
liigen die wahrhaftige vnd begehrte zal finden lernt.’ (Cf. Maas [1985]). 

In the form (4.42), the regula falsi can be thought of as a discretized Newton’s 
method, if in the latter, (4.55), one replaces the derivative f’(x,) by the difference 
quotient (f(x,) — f(b))/(xn — 6). This suggests one (of many) possible extension 
to systems of equations in R¢@ (cf. Ortega and Rheinboldt [2000, p. 205]): define 
vector difference quotients A,; = (x! — b‘)'[f (xn) — f (xn + (b' — x! )e;)), 
i = 1,2,...,d, where e; is the ith coordinate vector, x? = [x},...,x@] and b' = 
[b!,...,b@] a fixed vector. (Note that the arguments in the two f -vectors are the 
same except for the 7th one, which is bas in the first, and b! in the second.) If we let 
A, = [Ani,.--, Ana], a “difference quotient matrix” of order d, the regula falsi 
becomes X41 = Xn — AS Gs). As in the one-dimensional case, the method 
converges no faster than linearly, if at all (Ortega and Rheinboldt [2000, p. 366]). 


Section 4.5. The secant method is also rather old; it has been used, for example, in 
fifteenth-century Indian texts to compute the sine function (Plofker [1996]). 

The heuristic motivation of Theorem 4.5.1 uses some simple facts from the theory 
of linear difference equations with constant coefficients. A reader not familiar with 
these may wish to consult Henrici [1988, Sect. 7.4 and Notes to Sect. 7.4 on p. 663]. 

Like the method of false position, the secant method, too, can be extended to R¢ 
in many different ways (see, e.g., Ortega and Rheinboldt [2000, Sect. 7.2], Dennis 
and Schnabel [1996, Chap. 8]), one of which is to replace the vector b (in the Notes 
to Sect. 4.4) by the vector x,—;. Theorem 4.5.2 then remains in force (Ortega and 
Rheinboldt [2000, Sect. 11.2.9]). 

Examples of combined methods mentioned at the end of this section are Dekker’s 
method (Dekker [1969]) and its modification by Brent (Brent [2002]). The latter is 
implemented in the Matlab function f zero. 


Section 4.6. The history of Newton’s method is somewhat shrouded in obscurity. 
Newton’s original ideas on the subject, around 1669, were considerably more 
complicated and not even remotely similar to what is now conceived to be his 
method. Raphson in approximately 1690 gave a simplified version of Newton’s 
algorithm, possibly without knowledge of Newton’s work. In the English literature, 
the method is therefore often called the Newton—Raphson method. According to 
Kollerstrom [1992] and Ypma [1995], Newton’s and Raphson’s procedures are both 
purely algebraic without mention of derivatives (or fluxions, as it were). They credit 
Simpson with being the first to give a calculus description of Newton’s method 
in 1740, without referring either to Newton or to Raphson. As noted by Ypma 
[loc. cit.], Simpson applied Newton’s method even to a 2 x 2 system of nonlinear 
equations. The modern version of Newton’s iteration seems to appear first in a paper 
by Fourier published posthumously in 1831. See also Alexander [1996] for further 
historical comments. 
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Global convergence results for Newton’s method analogous to the one in 
Example 4.6.4 exist also in higher dimension; see, for example, Ortega and 
Rheinboldt [2000, Sects. 13.3.4 and 13.3.7]. Local convergence, including its 
quadratic order, is now well established; see Ortega and Rheinboldt [2000, Sect. 
12.6 and NR 12.6-1] for precise statements and for a brief but informative history. 
Crucial in this development was the work of Kantorovich, who studied Newton’s 
method not only in finite-dimensional, but also in infinite-dimensional spaces. A 
good reference for the latter is Kantorovich and Akilov [1982, Chap. 18]. For a 
modification of Newton’s method which is cubically convergent but requires an 
extra evaluation of f, see Ortega and Rheinboldt [2000, p. 315]. 


Section 4.7. Ostrowski [1973, Chap. 4, Sect. 2] calls @ a point of attraction for 
iteration (4.69) if for any xo in a sufficiently close neighborhood of a one has 
Xn, —> a, and a point of repulsion otherwise. Theorem 4.7.1 tells us that a is a 
point of attraction if |y’(a@)| < 1; it is clearly a point of repulsion if |g’(a)| > 1. 
An analogous situation holds in IR? (Ostrowski [loc. cit., Chap. 22]): if the Jacobian 
matrix dg/dx at x = a has spectral radius < 1, then @ is a point of attraction and 
hence the fixed point iteration is locally convergent. If the spectral radius is >1, then 
a is a point of repulsion. 


Section 4.8. An unusually detailed treatment of algebraic equations and their 
numerical solution, especially by older methods, is given in the French work by 
Durand [1960]. More recent texts are McNamee [2007] and Kyurkchiev [1998], the 
latter focusing more specifically on iterative methods for computing all roots of an 
algebraic equation simultaneously and in particular studying “critical” regions in the 
complex plane that give rise to divergence if the initial approximations are contained 
therein. Also focusing on simultaneous computation of all the roots is the book by 
Petkovié [2008]. Algebraic equations are special enough that detailed information 
can be had about the location of their roots, and a number of methods can be 
devised specifically tailored to them. Good accounts of localization theorems can 
be found in Householder [1970, Chap. 2] and Marden [1966]. Among the classical 
methods appropriate for algebraic equations, the best known is Graeffe’s method, 
which basically attempts to separate the moduli of the roots by successive squaring. 
Another is Cauchy’s method — a quadratic extension of Newton’s method — which 
requires second derivatives and is thus applied more readily to polynomial equations 
than to general nonlinear equations. Combined with the more recent method of 
Muller [1956] — a quadratic extension of the secant method — it can be made 
the basis of a reliable rootfinder (Young and Gregory [1988, Vol. 1, Sect. 5.4]). 
Among contemporary methods, mention should be made of the Lehmer—Schur 
method (Lehmer [1961], Henrici [1988, Sect. 6.10]), which constructs a sequence 
of shrinking circular disks in the complex plane eventually capturing a root, and of 
Rutishauser’s QD algorithm (Rutishauser [1957], Henrici [1988, Sect. 7.6]), which 
under appropriate separation assumptions allows all zeros of a polynomial to be 
computed simultaneously. The same global character is shared by the Durand— 
Kerner method, which is basically Newton’s method applied to the system of 
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equations a; = Og—j(@1,...,@q@),i = 0,1,...,d — 1, expressing the coefficients 
of the polynomial in terms of the elementary symmetric functions in the zeros. (The 
same method, incidentally, was already used by Weierstrass [1891] to prove the 
fundamental theorem of algebra.) For other global methods, see Werner [1982]. 
Iterative methods carried out in complex interval arithmetic (based on circular disks 
or rectangles) are studied in Petkovi¢ [1989]. From a set of initial complex intervals, 
each containing a zero of the polynomial, they generate a sequence of complex 
intervals encapsulating the respective zeros ever more closely. The midpoints of 
these intervals, taken to approximate the zeros, then come equipped with ready- 
made error estimates. 


Section 4.8.1 (a). For Ostrowski’s theorem, see Ostrowski [1954], and for literature 
on the numerical properties of Horner’s scheme and more efficient schemes, 
Gautschi [1975a, Sect. 1.5.1(vi)]. 

The adverse accumulation of error in polynomial deflation can be mitigated 
somewhat by a more careful deflation algorithm, which depends on the relative 
magnitude of the zero being removed; see Peters and Wilkinson [1971] and 
Cohen [1994]. 


Section 4.8.2. The accelerated Newton method is due independently to Kahan and 
Maehly (cf. Wilkinson [1988, p. 480]). A proof of Theorem 4.8.1 can be found in 
Stoer and Bulirsch [2002, Sect. 5.5]. 


Section 4.9. Among the many other iterative methods for solving systems of 
nonlinear equations, we mention the nonlinear analogues of the well-known iterative 
methods for solving linear systems. Thus, for example, the nonlinear Gauss— 
Seidel method consists of solving the d single equations f‘(x),),..-..»44).4, 
Pa a x2) =0,i = 1,2,...,d, fort and letting Cae = t. The solution of each 
of these equations will in turn involve some one-dimensional iterative method, for 
example, Newton’s method, which would constitute “inner iterations” in the “outer 
iteration” defined by Gauss—Seidel. Evidently, any pair of iterative methods can be 
so combined. Indeed, the roles can also be reversed, for example, by combining 
Newton’s method for systems of nonlinear equations with the Gauss—Seidel method 
for solving the linear systems in Newton’s method. Newton’s method then becomes 
the outer iteration and Gauss-Seidel the inner iteration. Still other methods involve 
homotopies (or continuation), embedding the given system of nonlinear equations 
in a one-parameter family of equations and approaching the desired solution 
via a sequence of intermediate solutions corresponding to appropriately chosen 
values of the parameter. Each of these intermediate solutions serves as an initial 
approximation to the next solution. Although the basic idea of such methods is 
simple, many implementational details must be worked out to make it successful; 
for a discussion of these, see Allgower and Georg [2003]. The application of 
continuation methods to polynomial systems arising in engineering and scientific 
problems is considered in Morgan [1987]. Both texts contain computer programs. 
Also see Watson et al. [1987] for a software implementation. 
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Section 4.9.1. If one is interested in individual components, rather than just 
in norms, one can refine the contraction property by defining a map @ to 
be I’-contractive on D if there exists a nonnegative matrix T ¢€ R¢*4, with 
spectral radius <1, such that |g(x) — g(x*)| < I'|x — x*| (componentwise) 
for all x,x* € D. The contraction mapping principle then extends naturally 
to [T-contractive maps (cf.Ortega and Rheinboldt [2000,Chap. 13]). Other 
generalizations of the contraction mapping principle involve perturbations of the 
map ¢ (Ortega and Rheinboldt [2000, Chap. 12, Sect. 2]). 


Section 4.9.2. For quasi-Newton methods (also called modification, or update, 
methods), see, for example, Ortega and Rheinboldt [2000, Chap. 7, Sect. 3], Dennis 
and Schnabel [1996, Chap. 6]. As with any iterative method, the increment vector 
(e.g., the modified Newton increment) may be multiplied at each step by an 
appropriate scalar to ensure that || f (x,+1)|| < || f'(x,)||. This is particularly 
advisable during the initial stages of the iteration. 


Exercises and Machine Assignments to Chapter 4 


Exercises 


1. The following sequences all converge to zero as n — 00: 
= = a?) ey —3.N 
Vv, =n!) ow, = 107", x, = 107", Vn = n'O.3-" zz, = 1077?"” 


Indicate the type of convergence by placing a check mark in the appropriate 
position in the following table. 


Type of Convergence v w x y Zz 
Sublinear 

Linear 

Superlinear 

Quadratic 

Cubic 

None of the above 


2. Suppose a positive sequence {e€,} converges to zero with order p > 0. Does it 
then also converge to zero with order p’ for any 0 < p’ < p? 

3. The sequence ¢, = en =0,1,..., clearly converges to 0 asin — ov. 
What is the order of convergence? 

4. Give an example of a positive sequence {¢,,} converging to zero in such a way 
that lim,—o0 at = 0 for some p > 1, but not converging (to zero) with any 


order p’ > p. 
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10. 


Suppose {x,,} converges linearly to @ in the sense that lim,—oo a = 
c, 0< |e] <1. 


(a) Define x* = $(Xn + Xp-1), n = 1,2,3,.... Clearly, x¥ — a. Does 
{x7} converge appreciably faster than {x,}? Explain by determining the 
asymptotic error constant. 

(b) Do the same for x* = ./x,X,—1, assuming x, > 0 for alln anda > 0. 


Let {x,} be a sequence converging to a linearly with asymptotic error con- 
stant c, 
Xn+1—a@ 


lim ———— =c, |c|<1, 
noo X,—a 


and assume that x, 4 @ for all n. 


(a) Derive Aitken’s A?-process (4.21) by assuming two consecutive ratios in 
the above limit relation (say, for and + 1) to be equal to c. 

(b) Show that the sequence {x/} in Aitken’s A?-process is well defined for n 
sufficiently large. 

(c) Prove (4.22). 


Given an iterative method of order p and asymptotic error constant c # 0, 
define a new iterative method consisting of m consecutive steps of the given 
method. Determine the order of this new iterative method and its asymptotic 
error constant. Hence justify the definition of the efficiency index given near 
the end of Sect. 4.2. 

Consider the equation 


1 2 4 


—-1=0. 
f= ES Bes 


(a) Discuss graphically the number of real roots and their approximate loca- 
tion. 
(b) Are there any complex roots? 


Consider the equation x tanx = 1. 


(a) Discuss the real roots of this equation: their number, approximate location, 
and symmetry properties. Use appropriate graphs. 

(b) How many bisections would be required to find the smallest positive root 
to within an error of 5 x 10-8? (Indicate the initial approximations.) Is your 
answer valid for all roots? 

(c) Are there any complex roots? Explain. 


Consider the quadratic equation x? — p = 0, p > 0. Suppose its positive root 
a = ,/p is computed by the method of false position starting with two numbers 
a, b satisfying 0 < a <a < b. Determine the asymptotic error constant c as a 


294 


17. 
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function of b and a. What are the conditions on b for 0 < c < 5 to hold, that is, 


for the method of false position to be (asymptotically) faster than the bisection 
method? 


. The equation x7 — a = 0 (for the square root a = ./a) can be written 


equivalently in the form 
x = g(x) 


in many different ways, for example, 


1 
e@=5(x+5). e@=5 guy a2r-* 


Discuss the convergence (or nonconvergence) behavior of the iteration x,4, = 
y(x,),n = 0,1,2,..., for each of these three iteration functions. In case of 
convergence, determine the order of convergence. 


. Under the assumptions of Theorem 4.5.1, show that the secant method cannot 


converge faster than with order p = 5 (1 + 4/5) if the (simple) root a of 
F(x) = O satisfies f(a) # 0. 


. Let {x,} be a sequence converging to a. Suppose the errors e, = |x, — @| 


satisfy @,41 < M C7 n—-1 for some constant M > 0. What can be said about the 
order of convergence? 


. Suppose the equation f(x) = 0 has a simple root a, and f’”(a) = 0, f’’"(a) # 


0. Provide heuristics in the manner of the text preceding Theorem 4.5.1 showing 
that the secant method in this case converges quadratically. 


. (a) Consider the iteration x,4; = cae Give a detailed discussion of the 


behavior of the sequence {x,,} in dependence of xo. 
(b) Do the same as (a), but for x,+,; = xt Xo > 0. 


. Consider the iteration 


Xn+1 = Q(Xn), v(x) = VJ24x. 


(a) Show that for any positive xo the iterates x,, remain on the same side of 
a = 2 as Xo and converge monotonically to a. 

(b) Show that the iteration converges globally, that is, for any xo > 0, and not 
faster than linearly (unless x9 = 2). 

(c) If 0 < xo < 2, how many iteration steps are required to obtain a with an 
error less than 107!°? 


Consider the equation x = cosx. 


(a) Show graphically that there exists a unique positive root a. Indicate, 
approximately, where it is located. 
(b) Prove local convergence of the iteration x,4+1; = COS Xp. 
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(c) For the iteration in (b) prove: if x, € [0, 5], then 


_ at+n/2 
[Xn+1—@| < (sin a) [Xn —@]. 


In particular, one has global convergence on [0, 5]. 
(d) Show that Newton’s method applied to f(x) = 0, f(x) = x —cosx, also 
converges globally on [0, 5]. 


18. Consider the equation 
x=" 
(a) Show that there is a unique real root a and determine an interval 
containing it. 
(b) Show that the fixed point iteration x,+; =e *",n = 0,1,2,..., converges 
locally to a and determine the asymptotic error constant. 
(c) Illustrate graphically that the iteration in (b) actually converges globally, 
that is, for arbitrary x9 > 0. Then prove it. 
(d) An equivalent equation is 
x =In 


BIS 


Does the iteration x,,4, = In - also converge locally? Explain. 


19. Consider the equation 


tanx +Ax=0, O<A<1. 


(a) Show graphically, as simply as possible, that in the interval [57, zc] there is 
exactly one root a. 

(b) Does Newton’s method converge to the root a € [57, gt] if the initial 
approximation is taken to be x9 = 2? Justify your answer. 


20. Consider the equation 


f(x) =0, f(x) =I’ x—x-1, x>0. 


(a) Graphical considerations suggest that there is exactly one positive root a, 
and that 0 < a < 1. Prove this. 

(b) What is the largest positive b < 1 such that Newton’s method, started with 
Xo = b, converges to a? 


21. Consider “Kepler’s equation” 
f(x) =0, f(x) =x-esinx—7n, 0<|e| <1, nER, 


where ¢€, 7) are parameters constrained as indicated. 
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22. 


23. 


(a) 
(b) 


(c) 


(d) 
(a) 


(b) 
(c) 


(d) 


(a) 


(b) 
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Show that for each ¢, 7 there is exactly one real root a = a(e,7). 
Furthermore, 7 — |e| < a(€,7) < n+ Ie. 
Writing the equation in fixed point form 


x= (x), g(x) =esinx+ 7, 


show that the fixed point iteration x,+; = g(x,) converges for arbitrary 
starting value xo. 

Let m be an integer such that ma < n < (m+ 1)z. Show that Newton’s 
method with starting value 


(m+ 1)x_ if (-1)"e>0, 
Xo = 
mz otherwise 


is guaranteed to converge (monotonically) to a(e, 7). 
Estimate the asymptotic error constant c of Newton’s method. 


Devise an iterative scheme, using only addition and multiplication, for 
computing the reciprocal 7 of some positive number a. {Hint: use Newton’s 
method. For a cubically convergent scheme, see Ex. 40.} 

For what positive starting values xo does the algorithm in (a) converge? 
What happens if x9 < 0? 

Since in (binary) floating-point arithmetic it suffices to find the reciprocal 
of the mantissa, assume + < a < 1. Show, in this case, that the iterates x, 


: 2 
satisfy 


1 


mea] < , aln>0d0. 


Using the result of (c), estimate how many iterations are required, at most, 


to obtain 1/a with an error less than 2-48 if one takes xp = 3. 


If A > 0, thena = VA is a root of either equation 
A 

x7-A=0, —-1=0 
x2 


Explain why Newton’s method applied to the first equation converges for 
arbitrary starting value x9 > 0, whereas the same method applied to the 
second equation produces positive iterates x, converging to a only if Xo is 
in some interval 0 < xo < b. Determine b. 

Do the same as (a), but for the cube root V/A and the equations 
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24. (a) Show that Newton’s iteration 


3(#+ 5) 
Xn41 =>) %1+—], a>0, 
2 Xn 


for computing the square root a = 4/a satisfies 


Xn+1—Q@ = 1 
(Xn ~ a)? 2Xp ; 
Hence, directly obtain the asymptotic error constant. 
(b) What is the analogous formula for the cube root? 


25. Consider Newton’s method 


for computing the square root a = ./a. Let dy = Xn+1 — Xn. 
(a) Show that 


a 
Xn = 


(b) Use (a) to show that 


Discuss the significance of this result with regard to the overall behavior of 
Newton’s iteration. 


26. (a) Derive the iteration that results by applying Newton’s method to f(x) := 
x? — a = 0 to compute the cube root a = a3 of a > 0. 

(b) Consider the equivalent equation f;(x) = 0, where fy(x) = x*-*—ax™, 
and determine A so that Newton’s method converges cubically. Write down 
the resulting iteration in its simplest form. 

27. Consider the two (equivalent) equations 


1 
(A) xInx-1=0, (B) Inx—--—=0. 
x 


(a) Show that there is exactly one positive root and find a rough interval 
containing it. 
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28. 
29. 


30. 


31. 


32. 
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(b) For both (A) and (B), determine the largest interval on which Newton’s 
method converges. {Hint: investigate the convexity of the functions in- 
volved.} 

(c) Which of the two Newton iterations converges asymptotically faster? 


Prove Theorem 4.6.1. 
Consider the equation 


f(x) =0, where f(x) = tanx —cx, 0<c <1. 
(a) Show that the smallest positive root @ is in the interval (z, 3m). 


(b) Show that Newton’s method started at x9 = z is guaranteed to converge to 
a if c is small enough. Exactly how small does c have to be? 


We saw in Sect. 4.1.1 that the equation 
cos x coshx —1=0 


has exactly two roots @, < 8, in each interval [—5 + 2nz,> + 2nz], n = 


1,2,3... . Show that Newton’s method applied to this equation converges to 
a, when initialized by x9 = == + 2nz and to B, when initialized by x9 = 
5 + 2nn. 


In the engineering of circular shafts the following equation is important for 
determining critical angular velocities: 


f(x) =0, f(x) = tanx + tanhx, x > 0. 


(a) Show that there are infinitely many positive roots, exactly one, a, in each 
interval [(n — 5)x,nr], n= 1,2, 3) 62.5 

(b) Determine lim,—o9(" — Qy). 

(c) Discuss the convergence of Newton’s method when started at x9 = nz. 


The equation 


F(x) := x tanx —1=0, 


if written as tanx = 1/x and each side plotted separately, can be seen to have 
infinitely many positive roots, one, @,, in each interval [nz,(n + $a], n= 
0, 12,208: 


(a) Show that the smallest positive root a can be obtained by Newton’s method 
started at x9 = 7. 

(b) Show that Newton’s method started with x» = (n + s\n converges 
monotonically decreasing to a, ifn > 1. 


(c) Expanding a, (formally) in inverse powers of zn, 
On = mn +ceo + e1(an) | +. e(an)? +03(an)? +-:- 


determine co, C1, C2,..., C9. {Hint: use the Maple series command.} 
(d) Use the Matlab function f zero to compute a, form = 1: 10 and compare 
the results with the approximation furnished by the expansion in (c). 
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33. 


34. 


35. 


36. 


37. 


Consider the equation 


f(x) =0, f(x) =xsinx-1,0<x <a. 


(a) Show graphically (as simply as possible) that there are exactly two roots in 
the interval [0, ] and determine their approximate locations. 

(b) What happens with Newton’s method when it is started with x9 = im? 
Does it converge, and if so, to which root? Where do you need to start 
Newton’s method to get the other root? 


(Gregory, 1672) For an integer n > 1, consider the equation 


f(x) =0, f(x) =x"*!-b"x+ab", a>0, b>0. 


(a) Prove that the equation has exactly two distinct positive roots if and only if 
n 


aS oT 
(n 4 1)!*a 


{Hint: analyze the convexity of f.} 

(b) Assuming that the condition in (a) holds, show that Newton’s method 
converges to the smaller positive root, when started at x9 = a, and to the 
larger one, when started at xo = D. 


Suppose the equation f(x) = 0 has the root a with exact multiplicity m > 2, 

and Newton’s method converges to this root. Show that convergence is linear, 

and determine the asymptotic error constant. 

(a) Let a be a double root of the equation f(x) = 0, where f is sufficiently 
smooth near @. Show that the “doubly relaxed” Newton’s method 


f (Xn) 

f'n)’ 
if it converges to a, does so at least quadratically. Obtain the condition 
under which the order of convergence is exactly 2, and determine the 
asymptotic error constant c in this case. 

(b) What are the analogous statements in the case of an m-fold root? 

Consider the equation x Inx = a. 


Xn+1 = Xn — 2 


(a) Show that for each a > 0 the equation has a unique positive root, x = x(a). 
(b) Prove that 


(i.e., limg—oo see = 1). {Hint: use the rule of Bernoulli-L Hospital.} 


(c) For large a improve the approximation given in (b) by applying one step of 
Newton’s method. 


300 


38. 


39. 


40. 


41. 


42. 
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The equation x? — 2 = 0 can be written as a fixed point problem in different 
ways, for example, 


@x=-, (b) x =x? +x-2, ()x=- 


How does the fixed point iteration perform in each of these three cases? Be as 
specific as you can. 
Show that 


Xn (x2 + 3a) 
3x2 +a 


Xn+1 = ; Wh = 0315.2, 60:05 


is amethod for computing @ = ./a, a > 0, which converges cubically to a (for 
suitable xo). Determine the asymptotic error constant. (Cf. also Ex. 43(e) with 
A =2.) 

Consider the fixed point iteration 


Xn41 = Q(X), n=O0,1,2,..., 


where 


p(x) = Ax + Bx? + Cx?. 


(a) Given a positive number a, determine the constants A, B, C such that the 
iteration converges locally to 1/a with order p = 3. {This will give a 
cubically convergent method for computing the reciprocal 1/a of a which 
uses only addition, subtraction, and multiplication.} 

(b) Determine the precise condition on the initial error ¢9 = x9 — + for the 
iteration to converge. 


The equation f(x) := x? — 3x + 2 = Ohas the roots | and 2. Written in fixed 
point form x = L(x? —(3—@)x + 2), w ¥ 0, it suggests the iteration 


1 
Xnt1 = —(x2 —(3-—@)x%, +2), n=0,1,2,... @ 0). 
0) 


(a) Identify as large an w-interval as possible such that for any w in this interval 
the iteration converges to 1 (when x9 1 is suitably chosen). 

(b) Do the same as (a), but for the root 2 (and xo # 2). 

(c) For what value(s) of w does the iteration converge quadratically to 1? 

(d) Interpret the algorithm produced in (c) as a Newton iteration for some 
equation F(x) = 0, and exhibit F. Hence discuss for what initial values xo 
the method converges. 


Let w be a simple zero of f and f € C” near a, where p > 3. Show: if 
f"(@ =-- = f? Ya) = 0, f™ (aw) # 0, then Newton’s method applied 
to f(x) = 0 converges to a locally with order p. Determine the asymptotic 
error constant. 
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43. The iteration 


_ fn) 
F' Cn) = fF" On) FS 


Xn+1 = Xn n= 0; 122.0565 


for solving the equation f(x) = 0 is known as Halley’s method. 


(a) Interpret Halley’s method geometrically as the intersection with the x- 
axis of a hyperbola with asymptotes parallel to the x- and y-axes that is 
osculatory to the curve y = f(x) atx = x, (i.e., is tangent to the curve at 
this point and has the same curvature there). 

(b) Show that the method can, alternatively, be interpreted as applying 
Newton’s method to the equation g(x) = 0, g(x) := f@)//7'@). 

(c) Assuming a is a simple root of the equation, and x, — a asn — oo, show 
that convergence is exactly cubic, unless the “Schwarzian derivative” 


(Sf) (x) = 


fo 3 (Lo) 
Fe) 2.70 


vanishes at x = a, in which case the order of convergence is larger than 3. 
(d) Is Halley’s method more efficient than Newton’s method as measured in 
terms of the efficiency index? 
(e) How does Halley’s method look in the case f(x) = x*—a,a > 0? 
(Compare with Ex. 39.) 


44. Let f(x) = x4 + ag—1x¢~! +--+ + ao be a polynomial of degree d > 2 with 
real coefficients a;. 


(a) In analogy to (4.75), let 
F(x) = (x? = tx — 8)(x4 + bg x4 +++ + bo) + bx — 8) + do. 


Derive a recursive algorithm for computing bg—, bg—2,...,b1, bo in this 
order. 

(b) Suppose @ is a complex zero of f. How can f be deflated to remove the 
pair of zeros a, a? 

(c) (Bairstow’s method) Devise a method based on the division algorithm of (a) 
to compute a quadratic factor of f'. Use Newton’s method for a system of 
two equations in the two unknowns ¢ and s and exhibit recurrence formulae 
for computing the elements of the 2 x 2 Jacobian matrix of the system. 


45. Let p(t) be a monic polynomial of degree n. Let x € C” and define 


Sfo(x) = [x1,%2,...,%] p, v=1,2,...,n, 
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to be the divided differences of p relative to the coordinates x,, of x. Consider 
the system of equations 


f(x) =0, [f(@)]" = (A), AG)... fr@)I- 


(a) Let a? = [ay,Q2,...,@,] be the zeros of p, assumed mutually distinct. 
Show that @ is, except for a permutation of the components, the unique 
solution of f(x) = 0. {Hint: use Newton’s formula of interpolation.} 

(b) Describe the application of Newton’s iterative method to the preceding 
system of nonlinear equations, f (x) = 0. {Hint: use Chap. 2, Ex. 58.} 

(c) Discuss to what extent the procedure in (a) and (b) is valid for nonpolyno- 
mial functions p. 


For the equation f(x) = 0 define 


yx) = x, 
1 
O] ss 
yi (x) Fi’ 
Mml(x) = : yl lg m = 2,3 
yl) = Fo a IO), , 


Consider the iteration function 


r(x) 2= ye yr EOE poy, 


m=0 


(When r = | this is the iteration function for Newton’s method.) Show that 
y,(x) defines an iteration x,+,; = 9;(X,), n = 0,1,2,... , converging locally 
with exact order p = r + 1 to aroot @ of the equation if y"*(@) f’(a) F 0. 


Machine Assignments 


1.(a) Write a Matlab program that computes (in Matlab double precision) the 


expanded form p(x) = x° — 5x* + 10x? — 10x? + 5x — 1 of the polynomial 
(x —1)°. Run the program to print p(x)/prec for 200 equally spaced x-values 
in a small neighborhood of x = 1 (say, 0.9986 < x < 1.0014), where prec = 
eps is the Matlab (double-precision) machine precision. Prepare a piecewise 
linear plot of the results. Explain what you observe. What is the “uncertainty 
interval” for the numerical root corresponding to the mathematical root 
x=1? 
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(b) Do the same as (a), but for the polynomial p(x) = x* — 100x* + 3995x? — 
79700x?+794004 x —3160080, the expanded form of (x—18)(x—19)(x—20) 
(x — 21)(x — 22). Do the computation in Matlab single precision and take 
for prec the respective machine precision. Examine a small interval around 
xX = 22 (say, 21.675 < x < 22.2). 
2. Consider the equation 


1 
—x—sinx = 0. 
2 


(a) Show that the only positive root is located in the interval [47, rr). 
(b) Compute the root to 7, 15, and 33 decimal places 


(b1) by the method of bisection, using the Matlab function sbisec of 
Sect. 4.3.1 with starting values a = tn, b=n; 

(b2) by the method of false position, using the Matlab function sfalsepos 
of Sect. 4.4 with the same starting values as in (b1); 

(b3) by the secant method, using the Matlab function ssecant of Sect. 4.5 
with the same starting values as in (b1); 

(b4) by Newton’s method, using the Matlab function snewton of Sect. 4.6 


with an appropriate starting value a. 
In all cases print the number of iterations required. 


3. For an integer n > 2, consider the equation 


x+x7! i 


xT 4x 


(a) Write the equation equivalently as a polynomial equation, p,(x) = 0. 

(b) Use Descartes’ rule of sign* (applied to p,(x) = 0) to show that there are 
exactly two positive roots, one in (0,1), the other in (1,00). How are they 
related? Denote the larger of the two roots by a, (>1). It is known (you do 
not have to prove this) that 


1 < Qn41 <Q <3, n=2,3,4,.... 


(c) Write and run a program applying the bisection method to compute a,, n = 
2,3,...,20, to six correct decimal places after the decimal point, using [1, 3] 
as initial interval for w2, and [1,q@,,] as initial interval for a4) (1 > 2). For 
each n, count the number of iterations required. Similarly, apply Newton’s 
method (to the equation p,(x) = 0) to compute a, to the same accuracy, 
using the initial value 3 for a2 and the initial value a, for a4) (n > 2). 


‘Descartes’ rule of sign says that if a real polynomial has s sign changes in the sequence of its 
nonzero coefficients, then it has s positive zeros or a (nonnegative) even number less. 
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(Justify these choices.) Again, for each n, count the number of iterations 
required. In both cases, print 1, a@,, and the number of iterations. Use a do 
while loop to program either method. 


4. Consider the equation 
x=e 


(a) Implement the fixed-point iteration x,4; = e~*” on the computer, starting 
with x9 = 1 and stopping at the first n for which x,4) agrees with x, to 
within the machine precision. Print this value of m and the corresponding 
Xn+1- 

(b) If the equation is multiplied by w (4 0 and 4 —1) and x is added on both 
sides, one gets the equivalent equation 


we*+x 
l+o 


2 — 


Under what condition on @ does the fixed-point iteration for this equation 
converge faster (ultimately) than the iteration in (a)? {This condition involves 
the root a of the equation.} 

(c) What is the optimal choice of w? Verify it by a machine computation in a 
manner analogous to (a). 


5. Consider the boundary value problem 


which describes the angular motion of a pendulum. 
(a) Use the Matlab integrator ode45 .m to compute and plot the solution u(x; s) 
of the associated initial value problem 


u'+sinu=0, O<x <—n, 


u(0) =0, w(0)=s 


for s = .2(.2)2. 

(b) Write and run a Matlab program that applies the method of bisection, 
with tolerance 0.5e—12, to the equation f(s) = 0, f(s) = u(i;s) — 1. 
Use the plots of (a) to choose starting values so, s; such that f(so) <0, 
f(s,;) > 0. Print the number of bisections and the value of s so ob- 
tained. {Suggestion: use a nonsymbolic version bisec.m of the program 
sbisec.mof Sect. 4.3.1.} 

(c) Plot the solution curve y(x) = u(x; 5) of the boundary value problem, with 
s as obtained in (b). 
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6. The boundary value problem 
y=, yy), OFx<1, 
yO=y, YON 
may be discretized by replacing the first and second derivatives by centered 


difference quotients relative to a grid of equally spaced points x, = 
k=0,1,...,n,n +1. 


a 
n+l’ 


(a) Interpret the resulting equations as a fixed point problem in R” and formulate 
the respective fixed point iteration. 

(b) Write a Matlab program that applies the fixed point iteration of (a) to the 
problem 


" 


yo-y=0, y(0)=0, yd)=1 


(cf. the first Example of Chap.7, Sect.7.1.1). Run the program for n = 
10, 100, 1000, 10000, stopping the iteration the first time two successive iter- 
ates differ by less than .5e—14 in the oo-norm. Print the number of iterations 
required and the maximum error of the final iterate as an approximation to the 
exact solution vector. Assuming this error is O(h”), determine numerically 
the values of p and of the constant implied in the order term. {Suggestion: for 
solving tridiagonal systems of equations, use the Matlab routine tridiag.m 
of Chap. 2, MA 8(a).} 

(c) Apply the fixed point iteration of (a) to the boundary value problem of MA 
5. Show that the iteration function is contractive. {Hint: use the fact that the 
symmetric n x n tridiagonal matrix A with elements —2 on the diagonal and 
1 on the two side diagonals has an inverse satisfying || A~!||oo < (n+ 1)7/8.} 


7. (a) Solve the finite difference equations obtained in MA 6(c) by Newton’s 
method, using nm = 10, 100, 1000, and an error tolerance of 0.5e—14. Print 
the number of iterations in each case and plot the respective solution curves. 
{Suggestion: same as in MA 6(b).} 

(b) Do the same as in (a) for the boundary value problem 


y"=yy', y0)=0, yd) =1, 


but with nm = 10,50, 100, and error tolerance 0.5e—6. How would you check 
your program for correctness? 

(c) Show that the fixed point iteration applied to the finite difference equations 
for the boundary value problem of (b) does not converge. {Hint: use n7/8 < 
| A! loo < (n + 1)?/8 for the n x n tridiagonal matrix A of MA 6(c).} 
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8. (a) 


(b) 


9. (a) 


(b) 


(c) 
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The Littlewood—Salem—Izumi constant oo, defined as the unique solution in 
0<a<lof 


32/2 cost 
i. eer 
t@ 
0 


is of interest in the theory of positive trigonometric sums (cf., e.g., 
Koumandos [2011, Theorem 2]). Use Newton’s method in conjunction 
with Gaussian quadrature to compute a. {Hint: you need the Matlab 
routine gauss .m along with the routines r_jacobi01l.m, r_jacobi.m, 
r_jaclog.m, and mm_log.m to do the integrations required. All these 
routines are available on the Web at http: //www.cs.purdue.edu/ 
archives/2002/wxg/codes/SOPQ.html.} 
Do the same as in (a) for the constant a, the unique solution in 0 < a < 1 of 
51/4 
/ cos(t + 2/4) ray, 
0 iad 


(cf. ibid, Theorem 4). 

Discuss how to simplify the system of nonlinear equations (4.17) for 
the recurrence coefficients of the polynomials {x;,,}_), generating the 
s-orthogonal polynomials 2, = zy,,, when the measure dA(t) = w(t)dt 
is symmetric, i.e., the support of dA is an interval [—a,a], 0 < a < ow, and 
w(—t) = w(t) for all t with 0 < t < a. {Hint: first show that the respective 
monic s-orthogonal polynomial z,, satisfies z,(—t) = (—1)"m,(t), t € 
[—a, a] and similarly m.,(—t) = (—1)' xn (t).} 

Forn = 2,5 = 1, and s = 2, explain how the recurrence coefficients Bo, 
f, can be obtained analytically in terms of the moments ju, = ie tkda(t), 
k =0,1,2,..., of the measure. Provide numerical answers in the case of 
the Legendre measure dA(t) = dt, t € [—1, 1]. 

Write a Matlab program for solving the system of nonlinear equations in 
(a), using the program fsolve of the Matlab Optimization Toolbox. Run 
the program for the Legendre measure and form = 2: 10 ands = 1 and 
Ss = 2 for each n. Choose initial approximations as deemed useful and apply 
appropriate (s + 1)n-point Gaussian quadrature rules to do the necessary 
integrations. Print Bo, 6i,..., 8,—1 and the zeros of z,. 
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6.(a) From 


Xnt+1-—A = C(Xn —qa), 


Xn42 —@ = C(Xn41 — a), 
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solving the first equation for c and substituting the result into the second 
equation gives 


(Xn — &)(Xn42 — ) = (Xn41 — a). 


Solving for a, one finds 


a2. 2; 
XnXn+2 — Xp = (Xn+1 — Xn) 


= %, 
Xn4+2— 2Xn41 + Xn Xn4+2 — 2Xn41 + Xn 


Raplacing a by x’, yields Aitken’s A*-process. 
(b) We need to show that A?x,, # 0 for n sufficiently large. Let e, = x, — a. 


We have 
A Are, En+2 Ent+1 En+1 
a = = —2 +1—> (c—1)? as n> 00, 
En En Ent+1 En En 


from which the assertion follows. 
(c) Let ef, = x/, — w. Subtracting w from both sides of 


; (Ag 


xX, = Xn — 
n 2x, 


n 


and dividing by e€,, we get 


fn _ ,_ Aen)” _, ea) 1 


i Bl Xs By BP Xn lB, 


Thus, by assumption and the result of (b), 


id 
1 
in. Si @=i —— = 


0, 
n>OO Ep, (c _ 1)? 
as claimed. 
9. (a) Clearly, with a, also —a is a root of the equation. It suffices, therefore, 
to look at positive roots. Writing the equation in the form tan x = 1/x 


and plotting both sides as functions of x for x > 0 (see figure on the next 
page), one sees that there is exactly one root a, in each of the intervals 
Ty = [ka, (k + 5)x], k =0,1,2,.... Moreover, as k — ov, the root a; 
approaches kz from the right. (Cf. also Ex. 32.) 

(b) Let f(x) = xtanx — 1. As initial interval, we may take (for example) 


the interval [0, 37], since f(0) = —1 and f() = +n tan *n -l= 
1.8441... > 0. We then want n, the number of bisections, such that (cf. 
(4.33)) 
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that is, 


8 + log 3% 
n> es | =o 
log 2 


This holds for all roots, since f(ka) = —1 < O and f(km + *t) = 
(kx + )tan 2-1 >0,k =1,2,3,.... 


15 f 


(c) We first apply the addition theorem for the tangent function to write 


tanx +tan(iy) _ tanx + itanhy 


tan(x +iy) = 7 
an(x + iy) 1—tanxtan(iy) 1 —itanxtanhy 


or, using the definition of the trigonometric and hyperbolic tangents, 


sin x cosh y + icos x sinh y 


tan(x + iy) = : 
( ») cos x cosh y —isin x sinh y 


Multiplying numerator and denominator by cosx cosh y + isin x sinh y 
gives 


sin2x + isinh2y 


tan(x + iy) = : 
( ¥) 2(cos? x cosh? y + sin? x sinh” y) 
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32. (a) 


Use the identities 2cos?x = cos2x + 1, 2sin?x = 1 —cos2x in the 
denominator, along with cosh? y — sinh? y = 1, cosh? y + sinh’? y = 
cosh 2y, to obtain the pretty formula 


baeAyy sin2x + isinh2y 
an(x + iy) = ——_—_——_—_.. 
. cos 2x + cosh2y 


Now suppose that 
(x + iy) tan(x +iy)-—1=0. 


Then both the real and imaginary part must vanish, which gives the two 
equations 


x sin2x — y sinh2y — cos2x — cosh2y = 0, 
x sinh2y + ysin2x = 0. 


The second can be written as 


( sin2x  sinh2y ) 
xy + ——]=0. 
x y 


Since the function in parentheses is always positive, this implies either x = 
0 or y = 0. The latter yields real roots, and the former is impossible, since 
by the first equation above it would imply y sinh2y + 1 + cosh2y = 0, 
which is clearly impossible, the left-hand side being >2. Thus, there are no 
complex roots. 

We have 


f'(x) = tanx + x(1 + tan’ x), 
f(x) = 2(1 + tan’ x) + 2x tan x(1 + tan? x) 


= 2(1 + tan” x)(1 + x tan x). 


On the interval [0, 4], the function f increases from —1 to +00 and is 
convex. Furthermore, f(7) = 4% —1 < 0. We thus need to show that 
one step of Newton’s method with xp = t produces x; such that x; < a 
Since, by convexity, x; > @o, from then on, Newton’s method will converge 
monotonically decreasing to a. Now, 


smd 1) ee a ee ae 


f(xo) _ = ae 1 ae 
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(b) 


(c) 
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which is indeed less than ss since 


eee (ey 


Again, f increases from —1 to +00 on [nz, (n + 57] and is convex. 


Since 
1 1 
f((n+ Z)2)=(n4Z)e-1>0 forn > 1, 


Newton’s method started at (” + i)x converges monotonically decreasing 
to My. 
We have 


a, tana, —1= 0, 


and therefore, inserting the expansion for @,, and noting that tan(- +77) = 
tan(-), 


(an + co + (an)! +.c2(an)? + ---) 
x tan(co + ¢)(an) | + (an)? +---)-1=0. 
Evidently, co = 0. Letting x = (xn)! and multiplying through by x gives 
(1 + cyx? + cox? +--+) tan(erx + cox? +---) -—x =0. 


Using Maple’s series command, we can find the power series expansion 
of the left-hand side up to terms of order x?. Maple produces the coeffi- 


cients explicitly as functions of c;,c2,..., C9. Setting these functions equal 
to zero and solving successively for the unknowns cj, C2,...,¢C9 yields, 
after a little bit of algebra, that co = c4 = co = ++: = O and 
c.=1; ea oa 
3 15 
1226 13597 
a ag as 
Thus, 
4 53 1226 
a, = an+(nn)!— 3m”) ae ig) Oe 705 ™ ¥ 
13597 


+ (xn)? + O((xn)7"). 
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(d) 


PROGRAMS 


EXIV_32D 


oe ol? 


£0='%5.0£ %19.15f %19.15f %10.2e\n’; 


disp([’ n 

‘alpha_n_ approx 

for n=1+;10 
a0=(n+1/4) xpi; 


a=fzero(’equ32’,a0); 


x=1/(pixn) ; 


alpha_n ' 
error’ ]) 


an=1/x+x-4*x°3/34+53*x75/15-1226*x*7/105 


+13597%x°9/315; 


err=an-a; 


fprintf(f0,n,a,an,err) 


end 


SEQU32 


The equation of 


function y=equ32 (x) 


y=x«tan (x 


OUTPUT 


bel; 


>> EXIV_32D 


n 


COMO WAN HUH F WN PB 


BH 


>> 


alpha_n 


-425618459481728 
-437298179171947 
-529334405361963 
-645287223856643 
.771284874815885 
-902409956860023 
-036496727938566 
.172446326646664 
-309642854452012 
-447714637546234 


34. (a) We have f(0) = ab”, f(co) = ov, and 


f'@) = 41x" —b", f"(~) =n +1)x"!>0~= forx 


EXIV_32 


alpha_n_ approx 

-426028729631524 
-437298435880711 
-529334408494419 
-645287223991568 
-771284874827579 
-902409956861607 
-036496727938854 
.172446326646728 
-309642854452026 
-447714637546238 


WrRANFRFRRHE WD BS 
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error 
.10e-04 
-57e-07 
.13e-09 
-35e-10 
-17e-11 
-58e-12 
.88e-13 
.39e-14 
.42e-14 
-55e-15 


> 0. 


Thus, f’(0) < 0 and f is convex on [0,co], so that f has a unique 
minimum, say at x = &. Then there are two distinct positive roots precisely 
when f(&) < 0. Since 


b 


4a 
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and 
prtl b 
fO= ___ 5" - tab" 
(n+l te 
n b 
=p" 


—nb % 
= £Ggy. 
(n+ 1)!+5 
the condition amounts to 

nb 
a < ———__.. 
(n+ 1)'*a 


(b) At the point x = a, we have 


f(a) = q't — b"a + ab" _ q't > 0, 


f(a) _ (n + 1)a” —b'< (n + 1) = => hb" — p” 


(n + 1)+! 
n+1 


where on the second line the condition in (a) has been used. Since by 
assumption a < = ae < &, this means that a must be to the left of the 
smaller root, a}. By convexity of {, Newton’s method started at x = a 


therefore converges monotonically increasing to a. Similarly, 


f(b) = prt — prt aie ab" = ab" > 0, 
f(b) = (n+ 1)b" — b" = nb" > 0, 


and b = (n+ Lit > &, implying that b > a2, where a is the larger 
root. By convexity of f, Newton’s method started at x9 = b then converges 
monotonically decreasing to a2. 

40. (a) Convergence to + requires that 


1 1 A. B . € 
= SS a 5 aaa 
a a a a a 


that is, 
wA+aB+C =a’. 
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‘ : . aes 171) — only ‘ 
Cubic convergence requires, in addition, g (5) = (3) = 0, that is, 


aA +2aB +3C = 0, 


2aB +6C = 0. 


We thus have three linear equations in the three unknowns A, B, C. Solving 
them yields 


A=3, B=-3a, C =a’, 
giving 
p(x) = 3x — 30x? + a7x?, 


(b) Letting 
1 


En = Xn - —; 


we have from the general theory of fixed point iterations (cf. (4.71)) that 


(since vy” is constant), that is, 
Aap Se ese, 4S 0/1;25..4; 


An easy inductive argument shows that 
1 1 
En = —(ae0)”. 
a 


Thus, we have convergence precisely if w|eo| < 1. 
43.(a) A hyperbola with the lines x = Xo and y = Yo as asymptotes has the 
equation 


(x — Xo)(y — Yo) — 5a” = 0. 


It intersects the x-axis at 
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The osculatory requirement of the hyperbola at the point (x, f,) (where 
Tn = f(%n), ete.) is expressed by the following three equations: 


1 
(Xn _ Xo)(Sn = Yo) _ xe = 0, 
tn -—Yor (Xn — Xo) f, = 0, 
2; + (Xn = Xo) f, = 0. 


The unknowns are Xo, Yo and a; the first is obtained immediately from the 
last equation, 


ia 
y= tn 2 ” 
and the first 
1 © 
9 a’ == 7a 
Therefore, 
1.329 , 3 
=a 4 
Xn41 = Xo— F— = ey +2 “oe ; In 
0 n " —-2—-,5 
i? (e-2 4%) 
! 2 s, 


" / 1 W Jn 


-2 4, ( f ci "h) 424, 


i \in Qin ff Sy’ 
=XxX- 
fi—Lprs 
n 2/n ff 
Sy 
it r_ lL gn tn’ 
n 2/n ie 


as claimed. 
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(b) Newton’s method applied to g = 0 is (in obvious notations) 


e =. Sn _ x tn M 
n+l = An An — 1 1 
En / i af he 5 ars a 
Sn i 
== 2 len 
5, ~ ee Sn 
In 
= Xn — r_ len tn’ 
no ae Ke 
(c) We have 
F(x) 


Xn+1 = Q(X), Where g(x) = x — 


f'(x) _ a f"(x) sat . 
Thus (dropping the argument x throughout), 
12 1 ” 12 1 ” ! 
fF? - sf" fo = xf? - sf" f)- ff". 


Differentiating with respect to x gives 


(Srer-grer)o+ (se 5r'r) 0 
= f= SF p ex(S PI ff) -f?-f9" 
a5 ff ta (Grr —sr"f). 


Since g(a) = a, putting x = a, one sees that the first term on the left and 
the last term on the far right cancel each other. What remains simplifies, in 
view of f(a) = 0, to 


f° (a)g' (a) = 0, 
hence, since f’(a) 4 0, to y’(a) = 0. Differentiating again, we get 


(Greases FF) 0+ aff" sty + (f2-S9"f) 0" 


=-2f"ftx( Spe fp" sfF). 
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Again putting x = @ and using g(a) = a, g(a) = 0, f(a) = 0, we obtain 
3 
(Sir or + rr" @s'@) a+ @Po"e) 
3 
= (Gir"@r + Fes") a 


that is, [f’(@)]?o"(@) = 0, hence y”(~) = O. Thus, the order of 
convergence is at least p = 3. To obtain g’” (a), we must differentiate once 
more and get, when x = a, 


Cizaae: sfOr) a + [f'@)Po"(a) 


X= 


=(-s77 4577) + (arte sor) a 


x=a@ 


ea) = -|5 -3(4) | _ = GSN. 


Thus, if (Sf)(@) 4 0, the order of convergence is exactly equal to 3. 

(d) Yes, slightly. Recall from Sect. 4.2 that the efficiency index is p!/"", where 
p is the order of the method and m the number of “units of work.” For 
Newton’s method this is /2 = 1.4142..., whereas for Halley’s method it 
is 33 = 1.4422... 

(e) Inthe case f(x) = x* — a, we get 


hence 


xi —a 


Axd-l — 1. -1)x)? ot 


Axd-T 1 


Xn+1 = Xn — 


Ax} —5(A-1)xh + 3(A- la - x? +a 
Axd — 5(A — 1)(x} — a) 


n 


— A=1)xi+ A+ 1a 
~ Feet 0—Da’™ 


45.(a) By Newton’s interpolation formula with remainder term, interpolating the 
nth-degree polynomial p by a polynomial of degree n — 1, we can write 


ioe NOT ewe ao Tt» 


v=1 pw=1 
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(b 


wm 


Since p is monic of degree n, we have p(t) = n!, and so 


n v-l n 
pO) = >) AG) [ [@-x.) + []e¢-x,). 
v=1 p= pw=1 


Suppose a! = [a,@2,...,Q,] is a solution of f(x) = 0. Then f,(@) = 0 
forv = 1,2,...,n, hence 


pt) = | [@-a), 


H=1 


showing that a, are the zeros of p, which are unique up to permutation. 
Conversely, let at = [@|,Q2,...,@,] be the zeros of p in some order. Then, 
identically in f, 


n v-l n 
p®) =>) A@) | [¢-@,)+ [[¢-4,), 
p=1 p=1 


v=1 


that is, since Lae — ay) = p(t), 


n v-1 
0= >) A@) []@- a). 
v=1 p= 


Letting f > a gives fi(@) = 0. Dividing the remaining equation by t — a 
and letting t > a gives fo(a) = 0. Continuing in this manner yields 
f(a) = 0,..., fr(w) = 0 in this order. 


The Jacobian of f is a lower triangular matrix, namely 
| [x1,x1]p 0 ae 0 ] 
0 X1,X1, x X1,X9,X tee 0 
oa ee eee Cle 
[Mis Mis Ropes 5 Xn|D [X15.%3jp-H9j 00053 Xn]p ot: [X1,X0,..-, Xn, Xn] p 


(cf. Ch. 2, Ex. 58). Given an approximation a! to the root vector a, 
Newton’s method requires the solution by forward substitution of the lower 
triangular system 


oF (allyl = — (all 


to get the next approximation vector 


git — gl 4 ali, 
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(c) The arguments in (a), (b) extend to functions p that are sufficiently 
smooth, for example, p € C”[R]. Then, if f,(@) = 0 for some a? = 
[@1,Q2,...,@,], the @,@2,...,@, are zeros of p and vice versa. The first 
statement is an immediate consequence of the first identity above in (a). The 
converse follows similarly as before by taking limits t > a, followed by 
division of both sides of the identity by t — a,,. 
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6.(a) The discretization of the boundary value problem amounts to solving the 
system of nonlinear equations 


Uk+1 — “) 


41 — 2p + up, = bh? (xe. : 
Uk+1 Ux + Uk-1 § \ Xk, Uk Ah 


k= 1,2 yng, 
Uo = Yo, Un+i=)1; 


where h = 1/(n + 1). With 


=2 0 
t =f 4 ie 
uz 
A — > “= ’ 
1 
Un 
0 t <2 


& (x1 U1, oF) 


g (x2, w2, 454 
F(u) (x2 ; I ) 


8 (tnt, At) 


this can be written as a fixed point problem 
Au=l’F(u)—b or u= A '(h’F(u)—b), 
where 


b= ye: + yien, e1 =[1,0,...,0]' ER”, e, = [0,0,...,1]' € R". 
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The corresponding fixed point iteration is 


Au!) =F (ul)—b, i =0,1,2,...; 


ul = initial approximation. 
(b) PROGRAM 


SMAIV_6B 
£0='’%8.0£ $4.0£ %12.4e %$12.4e\n’'; 
disp(’ n it err order’ ) 
eps0=.5e-14; 
for n=[10 100 1000 10000] 
h=1/(n+1); h2=h*2; 
a=-2x*ones(n,1); b=ones(n-1,1); c=b; 
en=eye(n); en=en(:,n); 
x=linspace(h,1-h,n)’; y=sinh(x)/sinh(1) ; 
it=0; ud0=zeros(n,1); ul=ones(n,1); 
while max(abs(ul-u0) ) >epso 
it=it+l; 
u0=ul; 
ul=tridiag(n,a,b,c,h2x*u0-en) ; 
end 
err=max(abs(ul-y)); ord=err/h2; 
fprintf (f£0,n,k,err,ord) 
end 


OUTPUT 


>> MAIV_6B 


n it err order 
10 16 3.6185e-05 4.3784e-03 
100 16 4.3345e-07 4.4217e-03 
1000 16 4.4131e-09 4.4219e-03 
10000 16 5.0103e-11 5.0113e-03 


>> 


Since the central difference approximations of the derivatives have errors 
of O(h7), the same can be expected for the errors in the solution. This is 
confirmed in the last column of the OUTPUT, suggesting also that the constant 
involved is about 5 x 107°. 

(c) The system of nonlinear equations is 


I 


Au = —h’ sinu—e,, h = —_—, 
4(n + 1) 
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where sin uw = [sin w,, sinu2,..., sinu,]". The iteration function is 
g(u) = —A~'|(h’ sinu + e,). 
We have, using the Mean Value Theorem applied to the sine function, 
g(u) — g(u*) = —h? A~|(sinu — sinu*) = —h* A~'diag(cos(y))(u — u*), 
where the elements of 7 are on the line segment between wu and u*. Therefore, 
lp(u) — P(u*)lloo SN? |A™ Joo |lu — u* loo. 


Here we have, using the Hint, 


2 


1 8 
WAT eo <h2-— 1)? = — = 077106... <1, 


so that @ is strongly contractive and we have rapid convergence of the fixed 
point iteration (cf. Theorem 4.9.1). 


8.(a) Let 


32/2 
fa) = [ Oa Gael, 
0 re 

Clearly, f(0) = sin(37/2) = —1 and f(1) = oo. A graph of the function 
reveals that f is monotonically increasing and convex, and thus has a unique 
zero in (0, 1); it is located near 0.3. Any initial approximation to the right of 
this zero will guarantee convergence of Newton’s method, owing to convexity. 
We choose 0.4 as initial approximation. 

To transform to a standard interval, we make the change of variables t = 
3a 


> Xx and have 


Differentiating, we get 


l-—a 1 
f'(a) = (S) tf COs (= | x~* In(1/x)dx 
0 
1 
—In a i cos (= x) axl : 


and Newton’s method becomes 


(i 
oft] = gti Fe) 
fia) 
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The integral defining f, and the second integral in the formula for f’, call for 
Gauss—Jacobi quadrature with Jacobi parameters 0 and —a. The first integral 
in f’ requires Gauss quadrature relative to the weight function x~® In(1/x) 
on [0,1]. The OPQ routines providing the recurrence coefficients for the 
respective orthogonal polynomials are r_jacobi01.m and r_jaclog.m, 
respectively, while the routine gauss .m generates the Gaussian quadrature 
formulae from the recurrence coefficients. The program below also deter- 
mines, and prints, the respective numbers n and no of quadrature points 
required for 14 decimal digit accuracy: 


PROGRAM 


SMAIV_8A 


° 
© 


nmax=50; eps0=.5e-14; 
al=.4; a=0; 
while abs(al-a)> epso 
a=al; 
ab=r_jacobi0l (nmax, 0, -a) ; 
ab0=r_jaclog(nmax, -a) ; 
n=1; sgl=0; sg=1; 
while abs (sgl-sg) >eps0 
n=n+1; sg=sgl; 
xw=gauss (n,ab) ; 
sgl=sum(xw(:,2).*cos(3*pixxw(:,1)/2)); 
end 
f=(3*pi/2)*(1-a) «sgl; 
n0O=1; sg01=0; sgO=1; 
while abs (sg01-sg0) >epso0 
nO=n0+4+1; 
sg0=sg01; 
xw0=gauss (n0,ab0) ; 
sg0l=sum(xw0(:,2).*cos(3*pixxw0(:,1)/2)); 


end 
fd=(3*pi/2) *(1-a) * (sg01-log(3«*pi/2) *sgl1) ; 
al=a-f£/fd; 

end 

fpranteL (” n=%2.0f, mn0=%2.0f, 


alpha0=%17.15f\n’ ,n,n0,al) 
OUTPUT 


>> MAIV_8A 
n=10, n0=10, alpha0=0.308443779561986 
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A high-precision value of @ (to 1, 120 digits) can be found in Arias de Reyna 
and Van de Lune [2009, Sect. 5]. 


(b) Letting 
51/4 t 4 
sia) = | as O<a<1, 
0 t 
a similar calculation as in (a) yields 


l-a 1 
Han (=) [ cos((5x + 1)2/4) Fe 


xe 


l-a 1 
g(a) = (=) if cos((5x + 1)2/4)x~* In(1/x)dx 
0 


—1 


51 i cos((5x + 1)2/4) 
n dx}. 
4 0 x 


PROGRAM 


SMAIV_8B 


2 
6 


nmax=50; eps0O=.5e-14; 
al=.7; a=0; 
while abs(al-a)> eps0 
a=al; 
ab=r_jacobi0l (nmax,0,-a) ; 
ab0=r_jaclog(nmax, -a) ; 
n=1; sgl=0; sg=1; 
while abs (sgl-sg) >eps0 
n=n+1; sg=sg1; 
xw=gauss (n,ab) ; 
sgl=sum(xw(:,2) .*cos (pix (5*xw(:,1)+1)/4)); 
end 
g=(5*pi/4)*(1l-a)*sgl; 
nO=1; sg01=0; sgO=1; 
while abs (sg01-sg0) >eps0 
nO=n0+4+1; 
sg0=sg01; 
xw0=gauss (n0,ab0) ; 
sg01l=sum(xw0(:,2) .*cos (pix (5*xw0 
(:,1)+1)/4)); 
end 
gd=(5«pi/4) *(1-a) « (sg01-log(5«pi/4) «sg1) ; 
al=a-g/gd; 
end 
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fprintf (’ n=%2.0f, mn0=%2.0f, 
alphal=%17.15f\n’ ,n,n0,al) 
OUTPUT 


>> MAIV_8B 
n=10, n0=10, alphal=0.614433447526100 


Chapter 5 
Initial Value Problems for ODEs: 
One-Step Methods 


Initial value problems for ordinary differential equations (ODEs) occur in almost 
all the sciences, notably in mechanics (including celestial mechanics), where the 
motion of particles (resp., planets) is governed by Newton’s second law — a system 
of second-order differential equations. It is no wonder, therefore, that astronomers 
such as Adams, Moulton, and Cowell were instrumental in developing numerical 
techniques for their integration.! Also in quantum mechanics — the analogue of 
celestial mechanics in the realm of atomic dimensions — differential equations are 
fundamental; this time it is Schrédinger’s equation that reigns, actually a linear 
partial differential equation involving the Laplace operator. Still, when separated in 
polar coordinates, it reduces to an ordinary second-order linear differential equation. 
Such equations are at the heart of the theory of special functions of mathematical 
physics. Coulomb wave functions, for example, are solutions of Schrédinger’s 
equation when the potential is a Coulomb field. 

Within mathematics, ordinary differential equations play an important role in the 
calculus of variations, where optimal trajectories must satisfy the Euler equations, or 
in optimal control problems, where they satisfy the Pontryagin maximum principle. 
In both cases, one is led to boundary value problems for ordinary differential 
equations. This type of problem is discussed in Chap. 7. In the present and next 
chapter, we concentrate on initial value problems. 

We begin with some examples of ordinary differential equations as they arise in 
the context of numerical analysis. 


‘Tn fact, it was by means of computational methods that Le Verrier in 1846 predicted the existence 
of the eighth planet Neptune, based on observed (and unaccounted for) irregularities in the orbit of 
the next inner planet. Soon thereafter, the planet was indeed discovered at precisely the predicted 
location. (Some calculations were done previously by Adams, then an undergraduate at Cambridge, 
but were not published in time.) 
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5.1 Examples 


Our first example is rather trivial: suppose we want to compute the integral 
b 
i= / F(t)dt (5.1) 


for some given function f on [a, b]. Letting y(x) = / F(t)dt, we get immediately 


d 


— = f(x), y(a)=0. (5.2) 
LX 


This is an initial value problem for a first-order differential equation, but a very 
special one in which y does not appear on the right-hand side. By solving (5.2) on 
[a,b], one obtains J = y(b). This example, as elementary as it seems to be, is not 
entirely without interest, because when we integrate (5.2) by modern techniques of 
solving ODEs, we can take advantage of step control mechanisms, taking smaller 
steps where f changes more rapidly, and larger ones elsewhere. This gives rise to 
what may be called adaptive integration. 

A more interesting extension of this idea is exemplified by the following 
integral, 


I= / Jo(t?)e'dt, (5.3) 
0 


containing the Bessel function Jo(x7), one of the special functions of mathematical 
physics. It satisfies the linear second-order differential equation 


d? 1\d 
mt ,) ee ae (4) 
with initial conditions 
yO)=1, y'(0)=0. (5.5) 


As is often the case with special functions arising in physics, the associated 
differential equation has a singularity at the origin x = 0, even though the solution 
y(x) = Jo(x’) is perfectly regular at x = 0. (The singular term in (5.4) has limit 0 
at x = 0.) Here, we let 


n@) = / “So(2etdt, yo(x) = Jo), ya(x) = h(x) (5.6) 
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and obtain 
d =x 
oe ag “y2, y1(0) = 0, 
dx 
d 
oy2 — Y3; y2(0) = 1, 
dx 
d 1 
oy3 =— (2 _ =) V3 Ax’ yo, y3(0) =0, (5.7) 
dx x 


an initial value problem for a system of three (linear) first-order differential equa- 
tions. We need to integrate this on (0, oo) to find J = y,;(co). An advantage of this 
approach is that the Bessel function need not be calculated explicitly; it is calculated 
implicitly (as component y2) through the integration of the differential equation 
that it satisfies. Another advantage over straightforward numerical integration is 
again the possibility of automatic error control through appropriate step change 
techniques. We could equally well get rid of the exponential e* in (5.7) by calling 
it y4 and adding the differential equation dy4/dx = —yg4 and initial condition 
y4(0) = 1. (The difficulty with the singularity at x = 0 can be handled, e.g., by 
introducing initially, say, forO < x < i, a new dependent variable 3, setting 
y= (2- +)y%, = (2- +)y3. Then y3(0) = 0, and the second and third equations 
in (5.7) can be replaced by 


dyz _ x. 
de l= a9" 
dy3 4 —x)_ 
a 4x (1 —2x) yo. 
dx a a a 


Here, the coefficients are now well-behaved functions near x = 0. Once x = 
has been reached, one can switch back to (5.7), using for y3 the initial conditio 
y3($) = —¥3(4),) 

Another example is the method of lines in partial differential equations, where 
one discretizes partial derivatives with respect to all variables but one, thereby 
obtaining a system of ordinary differential equations. This may be illustrated by 
the heat equation on a rectangular domain, 

du Pu 


a aa 0=2 <1, O42=27, (5.8) 
x 


1 
3 
n 


where the temperature u = u(x,t) is to satisfy the initial condition 
u(x,0)= g(x), O<x <I, (5.9) 
and the boundary conditions 


u(0,t) = A(t), u(l,t)= p(t), OK<t<T. (5.10) 
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Here, ¢ is a given function of x, and A, p given functions of t. We discretize in the 
x-variable by placing a grid Sen ai with x, =nh,h = wo on the interval 0 < 
x < 1 and approximating the second derivative by the second divided difference 
(cf. Chap. 2, (2.117) for n = 2, putting € = x;), 


Oru Un+1 — 2uUn + Un-1 
x7 | 42 7 NHN DN, (5.11) 
where 
Un = Un (t) = U(X, ft), n=0,1,...,.N 41, (5.12) 


Writing down (5.8) for x = x,,n = 1,2,...,N, and using (5.11), we get 
approximately 


duy, 1 ( ae ) 

= 7ZUn — LUyn Un—1), 
de We"! en eee 2 (5.13) 
Un(O) = 9(xn), 


an initial value problem for a system of N differential equations in the N unknown 
functions 1, u2,...,uy. The boundary functions A(t), p(t) enter into (5.13) when 
reference is made to up or uy +1 On the right-hand side of the differential equation. 
By making the grid finer and finer, hence / smaller, one expects to obtain better and 
better approximations for u(x,t). Unfortunately, this comes at a price: the system 
(5.13) becomes more and more “stiff” as h decreases, calling for special methods 
designed especially for stiff equations (cf. Sect. 5.9; Chap. 6, Sect. 6.5). 


5.2 Types of Differential Equations 


The standard initial value problem involves a system of first-order differential 
equations 


=F yeah PHT 2c asd, (5.14) 


which is to be solved on an interval [a, b], given the initial values 
Pig) kp ae al ye ee 2 (5.15) 
Here, the component functions are indexed by superscripts, and subscripts are 


reserved to indicate step numbers, the initial step having index 0. We use vector 
notation throughout by letting 


V=P yd LHL FT YO = DON HI 
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and writing (5.14), (5.15) in the form 


d 
= fy), @Sxsd; y@=yo. (5.16) 


We are thus seeking a vector-valued function y(x) € C![a, b] that satisfies (5.16) 
identically on [a,b] and has the starting value yo at x = a. About f(x, y), we 
assume that it is defined for x € [a,b] and all y € R42 

We note some important special cases of (5.16). 


1.d =1,y’ = f(x, y): a single first-order differential equation. 

2.d >1,u = g(x,u,u’,...,u4%~"): a single dth-order differential equation. 
The initial conditions here take the form uv (a) = uj, i = 0,1,...,d — 1. This 
problem is easily brought into the form (5.16) by defining 


yesu®), j= 1,2,...,d. 


Then 
a =e yl(a) = uf, 
i mays y°(a) = ug 
_ Tes yy w=ua* 
os C7 ae Oem: (5.17) 


which has the form (5.16) with very special (linear) functions f!, f?,..., f4—!, 
and f4(x,y) = g(x,y). 
Although this is the canonical way of transforming a single equation of order d 


into a system of first-order equations, there are other ways of doing this, which are 
sometimes more natural. Consider, for example, the Sturm—Liouville equation 


© (poo) + acu =o, a<x<b, (5.18) 
dx dx 


?That is, each component function f(x, y) is defined on [a, b] x R¢. In some problems, f (x, y) 
is defined only on [a,b] x D, where D C R®@ is a compact domain. In such cases, the solution 
y(x) must be required to remain in D as x varies in [a,b]. This causes complications, which we 
avoid by assuming D = R¢. 
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where p(x) 4 0 on [a, b]. Here, the substitution 


du 
yl =U, y = PX) (5.19) 
x 


is more appropriate (and also physically more meaningful), leading to the system 


dy! = a y? 

dx p(x)’ 

d 2 

—— = -a(x)y". (5.20) 


Mathematically, (5.20) has the advantage over (5.18) of not requiring that p be 
differentiable on [a, 5]. 


3. dy/dx = f(y), y € R@: an autonomous system of differential equations. Here, 
f does not depend explicitly on x. This can always be trivially achieved by 
introducing, if necessary, y°(x) = x and writing (5.16) as 


d 0 
S-[%") eseen[]on-[a]. om 


Many ODE software packages indeed assume that the system is autonomous. 
4. Second-order system 


du! : 1 d du! du! 
S20 | 6S — 3 eg ie PH Lid: 5.22 
dxz 8 ( 7 ax —} aa, 
Newton’s law in mechanics is of this form, with d = 3. The canonical 


transformation here introduces 


du! dué 
1_ot @ gp yl a 
Y =U,...,y uyy ae’ , Ae 
and yields a system of 2d first-order equations, 
dy! — d+ 
dx , 
dy? = 4 
dx 
dy?t! 
pe ad Ss ecm), 
x 
dy24 
a ese Oe (5.23) 
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5. Implicit system of first-order differential equations: 


. d 
Fi (sm, =) =0, i=1,2,...,d:n€R?. (5.24) 


Here, F! = F'(x,u, v) are given functions of 2d + 1 variables, which again we 
combine into a vector F = [F']. We denote by F, the partial derivative of F 
with respect to x, and by F,,, F, the Jacobian matrices 


OF! OF! 
F,(x, u,v) = sar | . B(x,u,v) = lS . 


Assuming that these Jacobians exist and that F, is nonsingular on [a, b]xR¢xR¢, 
the implicit system (5.24) may be dealt with by differentiating it totally with 
respect to x. This yields (in vector notation) 


& 
Ft Fu + Bo =O, (5.25) 
dx? 


where the arguments in F,, F,,, F, are x, u, u’ = du/dx. By assumption, this 
can be solved for the second derivative, 


au ; ; , du 
— =F ‘x, u, u’) —F\y.(x,u,u) — F,(x,u,u)—?, (5.26) 
dx? dx 


yielding an (explicit) system of second-order differential equations (cf. (iv)). If 
we are to solve the initial value problem for (5.24) on [a,b], with u(a) = uo 
prescribed, we need for (5.26) the additional initial data (du/dx)(a) = uo. This 
must be obtained by solving F(a,uo,u)) = 0 for uy, which in general is a 
nonlinear system of equations (unless F (x, u,v) depends linearly on vy). But from 
then on, when integrating (5.26) (or the equivalent first-order system), only linear 
systems (5.25) need to be solved at each step to compute d?u/dx? for given 
x,u,u’, since the numerical method automatically updates x, u,u' from step to 
step. 


5.3. Existence and Uniqueness 


We recall from the theory of differential equations the following basic existence and 
uniqueness theorem. 


Theorem 5.3.1. Assume that f (x,y) is continuous in the first variable for x € 
[a, b] and with respect to the second satisfies a uniform Lipschitz condition 


If. y)- fy) < Lily —y* |, x € lad], y.y* € RY, (5.27) 
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where || - || is some vector norm. Then the initial value problem (5.16) has a unique 
solution y(x), a < x < b, for arbitrary yo € R¢. Moreover, y(x) depends 
continuously on Xo and Yo. 


The Lipschitz condition (5.27) certainly holds if all functions of x, y), 


i,j = 1,2,...,d, are continuous in the y-variables and bounded on [a,b] x R¢. 
This is the case for linear systems of differential equations, where 


d 
fia W=>- ayy +b), 1=1,2,....4, (5.28) 


pal 


and a,j; (x), bj(x) are continuous functions on [a,b]. In nonlinear problems, it is 
rarely the case, however, that a Lipschitz condition is valid uniformly in all of R¢; 
it more often holds in some compact and convex domain D C R¢. In this case, one 
can assert the existence of a unique solution only in some neighborhood of xo in 
which y(x) remains in D. To avoid this complication, we assume in this chapter 
that D is so large that y (x) exists on the whole interval [a, b] and that all numerical 
approximations are also contained in D. Bounds on partial derivatives of f (x, y) 
are assumed to hold uniformly in [a,b] x D , if not in [a,b] x R®, and it is tacitly 
understood that the bounds may depend on D but not on x and y. 


5.4 Numerical Methods 


One can distinguish between analytic approximation methods and discrete-variable 
methods. In the former, one tries to find approximations yg(x) ~ y(x) to the 
exact solution, valid for all x € [a,b]. These usually take the form of a truncated 
series expansion, either in powers of x, in Chebyshev polynomials, or in some 
other system of basis functions. In discrete-variable methods, on the other hand, 
one attempts to find approximations u, € R@ of y(x,) only at discrete points x, € 
[a, b]. The abscissas x, may be predetermined (e.g., equally spaced on [a, b]) or, 
more likely, are generated dynamically as part of the integration process. If desired, 
one can from these discrete approximations {u,} again obtain an approximation 
ya(x) defined for all x € [a,b] either by interpolation or, more naturally, by 
a continuation mechanism built into the approximation method itself. We are 
concerned here only with discrete-variable methods. 

Depending on how the discrete approximations are generated, one distinguishes 
between one-step methods and multistep methods. In the former, u,+ is determined 
solely from a knowledge of x, u,, and the step h to proceed from x, to X;41 = X%,+ 
h. Inak-step method (k > 1), knowledge of k — 1 additional points (%,—¢, Un—x)s 
kK = 1,2,...,k — 1, is required to advance the solution. This chapter is devoted to 
one-step methods; multistep methods are discussed in Chap. 6. 
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When describing a single step of a one-step method, it suffices to show how 
one proceeds from a generic point (x,y), x € [a,b], y € R%, to the “next” point 
(x +A, Ynext). We refer to this as the local description of the one-step method. This 
also includes a discussion of the local accuracy, that is, how closely Ynext agrees at 
x +h with the solution (passing through the point (x, y)) of the differential equation. 
A one-step method for solving the initial value problem (5.16) effectively generates 
a grid function {u,}, Uy, € R¢, ona grida = x9 < xX} < Xp <-+: < Xy-1 < 
xy = b covering the interval [a,b], whereby w,, is intended to approximate the 
exact solution y(x) at x = x,. The point (x,+41, 4,41) is obtained from the point 
(X,,Un) by applying a one-step method with an appropriate step h, = X)+1 — Xn. 
This is referred to as the global description of a one-step method. Questions of 
interest here are the behavior of the global error u, — y(x,), in particular stability 
and convergence, and the choice of h, to proceed from one grid point x,, to the next, 
Xn+1 = X, +h,. Finally, we address special difficulties arising from the stiffness of 
the given differential equation problem. 


5.5 Local Description of One-Step Methods 


Given a generic point x € [a,b], y € R@, we define a single step of the one-step 
method by 


Ynext = Y +hO(x,y;h), h>0. (5.29) 


The function ®: [a,b] x R? x Ry — R¢@ may be thought of as the approximate 
increment per unit step, or the approximate difference quotient, and it defines the 
method. Along with (5.29), we consider the solution u(t) of the differential equation 
(5.16) passing through the point (x, y), that is, the local initial value problem 


d 
= Stu), xstexth; u(x) =y. (5.30) 


We call u(t) the reference solution. The vector Ynext in (5.29) is intended to 
approximate u(x + h). How successfully this is done is measured by the truncation 
error defined as follows. 


Definition 5.5.1. The truncation error of the method ® at the point (x,y) is 
defined by 


1 
T(x, y;h) = p Dinent — u(x + h)]. (5.31) 
The truncation error thus is a vector-valued function of d + 2 variables. Using 


(5.29) and (5.30), we can write for it, alternatively, 


T(x, y;:h) = ®(x, y;h)- [u(x +h) — u(x)], (5.32) 
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showing that T is the difference between the approximate and exact increment per 
unit step. 

An increasingly finer description of local accuracy is provided by the following 
definitions, all based on the concept of truncation error. 


Definition 5.5.2. The method ® is called consistent if 
T (x, y;h) > Oash > 0, (5.33) 


uniformly for (x, y) € [a,b] x R?.3 
By (5.32) and (5.30) we have consistency if and only if 


®(x,y:0) = f(x,y), x € [a,b], y eR’. (5.34) 


Definition 5.5.3. The method ® is said to have order p if, for some vector norm 
Il - Il, 

|T (x, ys A)|| < Ch?, (5.35) 
uniformly on [a, b] x R“, with a constant C not depending on x, y and h.4 


We express this property briefly as 
T (x, y;h) = O(h’), h-0. (5.36) 
Note that p > 0 implies consistency. Usually, p is an integer > 1. It is called the 


exact order, if (5.35) does not hold for any larger p. 
Definition 5.5.4. A function t: [a,b] x R¢ — R¢ that satisfies r(x, y) 4 0 and 


T(x, y:h) =t(x, y)h? + O(h?*')7, ho O, (5.37) 


is called the principal error function. 


The principal error function determines the leading term in the truncation error. 
The number p in (5.37) is the exact order of the method since t ¥ 0. 

All the preceding definitions are made with the idea in mind that h > 0 is a small 
number. Then the larger p, the more accurate the method. This can (and should) 
always be arranged by a proper scaling of the independent variable x; we tacitly 
assume that such a scaling has already been made. 


3More realistically, one should require uniformity on [a,b] x D, where D C R? is a sufficiently 
large compact domain; cf. Sect. 5.3. 

4If uniformity of (5.35) is required only on [a,b] x D, D C R@ compact, then C may depend on 
D; cf. Sect. 5.3. 
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5.6 Examples of One-Step Methods 


Some of the oldest methods are motivated by simple geometric considerations based 
on the slope field defined by the right-hand side of the differential equation. These 
include the Euler and modified Euler methods. More accurate and sophisticated 
methods are based on Taylor expansion. 


5.6.1 Euler’s Method 


Euler proposed his method in 1768, in the early days of calculus. It consists 
of simply following the slope at the generic point (x,y) over an interval of 
length h: 


Ynext = Y + hf (x,y). (5.38) 


Thus, ®(x, y;h) = f(x,y) does not depend on h, and by (5.34) the method is 
evidently consistent. For the truncation error, we have by (5.32) 


T(x, ysh) = f(xy) — Flue +h) = uC), 65.39) 


where u(t) is the reference solution defined in (5.30). Since u’(x) = f(x, u(x)) = 
J (x, y), we can write, using Taylor’s theorem, 


II 


T(x, ysh) = W(x) — 5 f(x + A) a) 


= u'(x)— Eo + hu'(x) + 5H" — uo)| 


- -5 hu'(—), x<&<x+h, (5.40) 


assuming uw € C7[x,x + h]. This is certainly true if f € C! on [a,b] x R¢, as we 
assume. Note the slight abuse of notation in the last two equations, where & is to be 
understood to differ from component to component but to be always in the interval 
shown. We freely use this notation later on without further comment. 

Now differentiating (5.30) totally with respect to ¢ and then setting f = & 
yields 


T(x, ysh) = 5 Mf + fy FE ul), (5.41) 
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where f, is the partial derivative of f with respect to x and fy the Jacobian of f 
with respect to the y-variables. If, in the spirit of Theorem 5.3.1, we assume that f 
and all its first partial derivatives are uniformly bounded in [a, b] x R@, there exists 
aconstant C independent of x, y, and / such that 


|T (x, y;h)|| < Ch. (5.42) 
Thus, Euler’s method has order p = 1. If we make the same assumption about 
all second-order partial derivatives of f/, we have u"(&) = u(x) + O(h) and, 
therefore, from (5.40), 
T(x, y:h) =—Zh[ fi + fy fl.) + OH), h > 0, (5.43) 
showing that the principal error function is given by 
t(x,y) = 31 + fy fe. y). (5.44) 


Unless f, + fy f = 0, the order of Euler’s method is exactly p = 1. 


5.6.2 Method of Taylor Expansion 


We have seen that Euler’s method basically amounts to truncating the Taylor 
expansion of the reference solution after its second term. It is a natural idea, already 
proposed by Euler, to use more terms of the Taylor expansion. This requires the 
computation of successive “total derivatives” of f, 


MG, y) = £@,y), 
fOle,y) = (e+ TP @ysGY), €=0,1,2,..., 645) 


which determine (see Ex. 2) the successive derivatives of the reference solution u(t) 
of (5.30) by virtue of 


u®tD aq) = fElit,u(t)), & =0,1,2,.... (5.46) 
These, for tf = x, become 

u®t)(y) = fHla,y), k=0,1,2,..., (5.47) 
and are used to form the Taylor series approximation according to 


1 1 _ = 
Ynext = ¥ +h[ £9) + hI y) tot al ed Mess): 
(5.48) 
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that is, 
1 1 
G(x, y:h) = f |x, y) + 5 AFG y) feet me FP Ney). (5.49) 


For the truncation error, assuming f € C? on [a,b] x R@ and using (5.47) and 
(5.49), we obtain from Taylor’s theorem 


1 
P(x, yih) = @(%, yh) — 5 [eG +h) — u@)] 


ile (2b eer 


= @(x, y:h)— > u ®t) (x) < 7 Di = +p! 


k=0 
(p+1) hP h 
=-tu © Cap x<&<x+ 7 


so that 


C 
T(x, y;h)|| < —“??’, 5.50 
ITM s oy (5.50) 
where C, is a bound on the pth total derivative of f. Thus, the method has exact 
order p (unless f!?!(x, y) = 0), and the principal error function is 


T(x.y) = Seas y). (5.51) 


The necessity of computing many partial derivatives in (5.45) was a discouraging 
factor in the past, when this had to be done by hand. But nowadays, this labor 
can be delegated to computers, so that the method has become again a viable 
option. 


5.6.3 Improved Euler Methods 


There is too much inertia in Euler’s method: one should not follow the same initial 
slope over the whole interval of length h, since along this line segment the slope 
defined by the slope field of the differential equation changes. This suggests several 
alternatives. For example, we may wish to reevaluate the slope halfway through the 
line segment — retake the pulse of the differential equation, as it were — and then 
follow this revised slope over the whole interval (cf. Fig. 5.1). In formula, 


1 1 
Ynext = Y thf (« - se + shf(s.9)). (5.52) 


338 5 Initial Value Problems for ODEs: One-Step Methods 


if) 
wn 
a ae 


Ynext 
y 
Y 
x xth 
Fig. 5.1. Modified Euler method 
or 1 I 
Sori =F (x4 Shy + SAF Cry). (5.53) 


Note the characteristic “nesting” of f that is required here. For programming 
purposes, it may be desirable to undo the nesting and write 


ki(x,y) = S(x,y), 
1 1 
kx(x,y;sh=f (x4 ah + hi), 
Ynext = Y + hk». (5.54) 


In other words, we are taking two trial slopes, k, and k2, one at the initial point and 
the other nearby, and then taking the latter as the final slope. 

We could equally well take the second trial slope at (x +h, y +hf (x, y)), but 
then, having waited too long before reevaluating the slope, take now as the final 
slope the average of the two slopes: 


ki(x,y) = f(x,y), 
ko(x, ysh) = f(x +h,y + hk), 


1 
Ynext = Y + 5 h(k, + kz). (5.55) 


This is sometimes referred to as Heun’s method or the trapezoidal rule. 
The effect of both modifications is to raise the order by 1, as is shown in the next 
section. 
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5.6.4 Second-Order Two-Stage Methods 


We may take a more systematic approach toward modifying Euler’s method, by 
letting 


®(x, y3h) = ak, + ako, (5.56) 
where 
ki(x,y) = f(x,y), 
ko(x, y;h) = f(x + wh, y + whk)). (3.57) 


We have now three parameters, a), @2, and j1, at our disposal, and we can try to 
choose them so as to maximize the order p. A systematic way of determining the 
maximum order p is to expand both ®(x, y;) and h7' [u(x +h) —u(x)] in powers 
of h and to match as many terms as we can. 

To expand ®, we need Taylor’s expansion for (vector-valued) functions of several 
variables, 


f(x + Ax, y+ Ay) =f t+ frAx+ fyAy + 5 Lfe(Ax? +2 fxyAxAy 
+ TAG) Fy(Ay)| Pe, (5.58) 


where f, denotes the Jacobian of f and fy, = [ ba y] the vector of Hessian 
matrices of f. In (5.58), all functions and partial derivatives are understood to be 
evaluated at (x, y). Letting Ax = wh, Ay = whf then gives 
1 
ko(x,ysh)= ft+puh(fe + Syf) + 3 oa a @ 2 °F 2hryf F t fehl) 
+ O(1). (5.59) 


Similarly (cf. (5.47)), 


7 [ux +h) —u(x)] = u'(x) + 5 hu!" (x) + iP ul"(x) + 0(h7), — (5.60) 


where 
u(x) =f, 
W@)= (Y= A+ hf. 
w(x) = f= fs fe 
= frxt fof t+ fof t (fey + fy Aw f 
= fox t+ 2fayf +f foyf + fy fet fof): 
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and where in the last equation we have used (see Ex. 3) 


(fy Dy f = Lo fyyf + ron 


Now, 


1 
T(x, ysh) = ak + ak. — i [u(x +h) — u(x)], 


wherein we substitute the expansions (5.59) and (5.60). We find 
1 1,, > | 
T(x, y;h) =(@1 +02.—-1)f + Ono — 5 hfe + yf) + xh Oat — 3 


1 
an + Bag + I Say f)— 5 Slut fof] + OW) 
(5.61) 


We can see now that, however we choose the parameters a, a2, [4, we cannot 
make the coefficient of h? equal to zero unless severe restrictions are placed on f 
(cf. Ex. 6(c)). Thus, the maximum possible order is p = 2, and we achieve it by 
satisfying 


a+a=1, 
1 
Oop = ~. (5.62) 
2 
This has a one-parameter family of solutions, 
a, =1-—a, 
1 (a2 # 0 arbitrary). (5.63) 
b= a’ 
O2 
We recognize the improved Euler method contained therein with a2 = 1, and 


Heun’s method with a. = 5 There are other natural choices; one such would be to 
look at the principal error function 


1 1 1 1 
t(x,y) = =|( —-=) (fix t2fof +t fy A-shKthS) 
2, 4a2 3 3 
(5.64) 
and to note that it consists of a linear combination of two aggregates of partial 
derivatives. We may wish to minimize some norm of the coefficients, say, the sum of 


their absolute values. In (5.64), this gives trivially (4a)~! — i = 0, that is, a7 = 3, 
and hence suggests a method with 


(0) = 3, p= 2, (5.65) 
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5.6.5 Runge-Kutta Methods 


Runge-Kutta methods are a straightforward extension of two-stage methods to r- 
stage methods: 


O(x, yh) = )o asks, 
s=1 


ki(x,y) = f(x,y), 
s—l1 


ks(x,yih) =f [xt ush,y th) Agk; |, 5 =2,3,...,7. (5.66) 


j=l 


It is natural in (5.66) to impose the conditions (cf. Ex. 10 for the first) 


s—l r 
f= ie aia yo (5.67) 
v= s=1 


where the last one is nothing but the consistency condition (cf. (5.34)). We call 
(5.66) an explicit r-stage Runge-Kutta method; it requires r evaluations of the right- 
hand side f of the differential equation. More generally, we can consider implicit 
r-stage Runge-Kutta methods 


O(x, y:h) = Do asks (x, ysh), 


s=l1 


ky=f\xtpushy +h) Agk; |, 5 =1,2,...,7, (5.68) 


j=l 


in which the last r equations form a system of (in general nonlinear) equations in the 
unknowns ky, ko,...,k,. Since each of these is a vector in R¢, we havea system of 
rd equations in rd unknowns that must be solved before we can form the approx- 
imate increment ®. Less work is required in semi-implicit Runge-Kutta methods, 
where the summation in the formula for k; extends from 7 = 1 to 7 = s only. This 
yields r systems of equations, each having only d unknowns, the components of ks. 
The considerable computational expense involved in implicit and semi-implicit 
methods can only be justified in special circumstances, for example, stiff problems. 
The reason is that implicit methods not only can be made to have higher order than 
explicit methods, but also have better stability properties (cf. Sect. 5.9). 

Already in the case of explicit Runge-Kutta methods, and even more so in 
implicit methods, we have at our disposal a large number of parameters which we 
can choose to achieve the maximum possible order for all sufficiently smooth f. 
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The approach is analogous to the one taken in Sect. 5.2.4, only technically much 
more involved, as the partial-derivative aggregates that are going to appear in the 
principal error function are becoming much more complicated and numerous as r 
increases. In fact, a satisfactory solution of this problem has become possible only 
through the employment of graph-theoretical tools, specifically, the theory of rooted 
trees. This was systematically developed by J.C. Butcher, the principal researcher 
in this area. A description of these techniques is beyond the scope of this book. We 
briefly summarize, however, some of the results that can be obtained. 

Denote by p*(r) the maximum attainable order (for arbitrary sufficiently smooth 
f+) of an explicit r-stage Runge-Kutta method. Then Kutta> has already shown in 
1901 that 


p(r)=r for 1<r<4, (5.69) 


and has derived many concrete examples of methods having these orders. Butcher, 
in the 1960s, established that 


P(r)=r-1 for 5<r<7, 
p(r)=r—-2 for 8<rK<9, 
P(r)<r-—2 for r>=10. (5.70) 


Specific examples of higher-order Runge—Kutta formulae are used later in con- 
nection with error monitoring procedures. Here, we mention only the classical 
Runge-Kutta formula® of order p = 4: 


1 
O(x, y;h) = é (kK, + 2k. + 2k3 + ka), 
ki(x,y) = f(x, y), 


1 1 
kx(x, y;h) = t (x4 shy + hk) 


1 1 
k3(x, y;h) = t (x4 shy + 5 he) 


ka(x, ysh) = f(x +h, y + hk3). (5.71) 


5 Wilhelm Martin Kutta (1867-1944) was a German applied mathematician, active at the Technical 
University of Stuttgart from 1911 until his retirement. In addition to his work on the numerical 
solution of ODEs, he did important work on the application of conformal mapping to hydro- and 
aerodynamical problems. Best known is his formula for the lift exerted on an airfoil, now known 
as the Kutta—Joukowski formula. For Runge, see footnote 7 in Chap. 2, Sect. 2.2.3 


®Runge’s idea, in 1895, was to generalize Simpson’s quadrature formula (cf. Chap. 3, Sect. 3.2.1) 
to ordinary differential equations. He succeeded only partially in that the generalization he gave 
had stage number r = 4 but only order p = 3. The method (5.71) of order p = 4 was discovered 
in 1901 by Kutta through a systematic search. 
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When f does not depend on y, and thus we are in the case of a numerical quadrature 
problem (cf. (5.2)), then (5.71) reduces to Simpson’s formula. 


5.7. Global Description of One-Step Methods 


We now turn to the numerical solution of the initial value problem (5.16) with the 
help of one-step formulae such as those developed in Sect. 5.6. The description of 
such one-step methods is best done in terms of grids and grid functions. A grid on 


the interval [a, b] is a set of points {x,}*_, such that 
aA=Xo <X) < Xo < +++ < Xy_-1 < xy =), (5.72) 
with grid lengths h, defined by 
An =Xn+1—Xn, n=O0,1,...,N—-1. (5.73) 


The fineness of the grid is measured by 


|h| = max hy. (5.74) 


0<n<N-1 


We often use the letter i to designate the collection of lengths h = {h,}. If 
hy = hy =--- = hy_, = (b—a)/N, we call (5.72) a uniform grid, otherwise 
a nonuniform grid. For uniform grids, we use the letter h also to designate the 
common grid length h = (b — a)/N. A vector-valued function v = {v,}, v, € R4, 
defined on the grid (5.72) is called a grid function. Thus, v, is the value of v at 
the gridpoint x,. Every function v(x) defined on [a, b] induces a grid function by 
restriction. We denote the set of grid functions on [a,b] by T;,[a, b], and for each 
grid function v = {y,} define its norm by 


Ilvlloo = max |v, |], v ¢T,[a, 5]. (5.75) 
O0<n<N 


A one-step method — indeed, any discrete-variable method — is a method 
producing a grid function wu = {u,} such thatu ~ y, where y = {y,} is the 
grid function induced by the exact solution y (x) of the initial value problem (5.16). 
The grid (5.72) may be predetermined, for example, a uniform grid, or, as is more 
often the case in practice, be produced dynamically as part of the method (cf. 
Sect. 5.8.3). The most general scheme involving one-step formulae is a variable- 
method variable-step method. Given a sequence {®,,} of one-step formulae, the 
method proceeds as follows, 


Xn+1 = Xn + hy, 
n=0,1,...,N—-1, (5.76) 


Un+1 = Un + hy ®y (Xn, Un; hn), 
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where x9 = ad, Uo = Yo. For simplicity, we only consider single-method schemes 
involving a single method ®, although the extension to variable-method schemes 
would not cause any essential difficulties just more writing). 

To bring out the analogy between (5.16) and (5.76), we introduce operators R and 
R;, acting on C' a, b] and T°, [a, b], respectively. These are the residual operators 


(Rv) (x) := v(x) — f(x. v(x),  v € Cla, 5], (5.77) 
(Riv) n = A (Vn+1 = Vn) = ®(Xn,Vn5 hy).n = 0, 1, eee N- 1; 
v = {v,} € Ty [a,b]. (5.78) 


(The grid function {(R;,v),} is not defined for n = N, but we may arbitrarily set 
(Rav) = (Rnv)y—1-) Then the problem (5.16) and its discrete analogue (5.76) can 
be written transparently as 


Ry =0 on[a,b], y(a)= yo, (5.79) 
R,u=0 onl[a,b], uo = yo. (5.80) 


Note that the discrete residual operator (5.78) is closely related to the truncation 
error (5.32) when we apply the operator at a point (x,, y(x,)) on the exact solution 
trajectory. Then indeed the reference solution u(t) coincides with the solution y (ft), 
and 


1 
(Ray )n = Ee ly (%n+41) _ y(xn)] _ O(xn, Y (Xn); hy) =-T(xn, Y(Xn); hy). 
: (5.81) 


5.7.1 Stability 


Stability is a property of the numerical scheme (5.76) alone and a priori has 
nothing to do with its approximation power. It characterizes the robustness of the 
scheme with respect to small perturbations. Nevertheless, stability combined with 
consistency yields convergence of the numerical solution to the true solution. 

We define stability in terms of the discrete residual operators Ry, in (5.78). As 
usual, we assume ®(x, y:/) to be defined on [a,b] x R@ x [0, Ao], where ho > 0 is 
some suitable positive number. 


Definition 5.7.1. The method (5.76) is called stable on [a,b] if there exists a 
constant K > O not depending on h/ such that for an arbitrary grid h on [a, b], 
and for arbitrary two grid functions vy, w € I,[a, b], there holds 


lv —Wlloo < K(||vo — wo] + || Rav — Rawlloo), vw € Tyla, d], (5.82) 
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for all 4 with |h| sufficiently small. In (5.82), the infinity norm for grid functions is 
the norm defined in (5.75). 


We refer to (5.82) as the stability inequality. The motivation for it is as follows. 
Suppose we have two grid functions w, w satisfying 

Ryu =0, uo = yo. (5.83) 

Rw =€, Wo=yot MN: (5.84) 

where ¢ = {e,} € Ty[a,5] is a grid function with small |le,||, and ||79|| is also 

small. We may interpret uw € T;,[a, b] as the result of applying the numerical scheme 


(5.76) in infinite precision, whereas w € T;,[a, b] could be the solution of (5.76) in 
floating-point arithmetic. Then if stability holds, we have 


||u —Wlloo S K(Ilmoll + lle lloo): (5.85) 


that is, the global change in wu is of the same order of magnitude as the local 
residual errors {€,,} and initial error 79. It should be appreciated, however, that the 
first equation in (5.84) says Wy41 — Wn — An ®(%q,Wai ln) = hn€n, Meaning that 
“rounding errors” must go to zero as |h| > 0. 

Interestingly enough, a Lipschitz condition on ©® is all that is required for 
stability. 


Theorem 5.7.1. If ®(x, y;/) satisfies a Lipschitz condition with respect to the 
y-variables, 


®@(x, ys h) — O(x, y*h)|| < Mlly — y*|| 07 [a,b] x R¢ x [0,Ao], (5.86) 
then the method (5.76) is stable. 


We precede the proof with the following useful lemma. 


Lemma 5.7.1. Let {e,} be a sequence of numbers e, € R satisfying 
Cnt <Anen thn, n=0,1,...,N—-1, (5.87) 
where a, > 0 and b, € R. Then 
n—1 n—-1 n—-1 
en = En, = (Tla)o+¥( I] ac) n=0,1,...,N. (5.88) 
k=0 k=0 \€=k+1 


We adopt here the usual convention that an empty product has the value 1, and 
an empty sum the value 0. 


Proof of Lemma 5.7.1. It is readily verified that 


En+1 = an En + Dy, n=0,1,...,N—1; Eo = €0. 
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Subtracting this from the inequality in (5.87), we get 
€n+1 — En+i < Qn(€n — En), n=0,1,...,N—1. 


Now, eo — Eo = 0, so that e; — FE, < 0. Therefore, e2 — Ey < a; (e; — F)), implying 
é2 — Ey < O since a; > 0. In the same way, by induction, e, — FE, < 0. Oo 


Proof of Theorem 5.7.1, Leth = {h,} be an arbitrary grid on [a,b], andv,w € 
I, [a, b] two arbitrary (vector-valued) grid functions. By definition of R;,, we can 
write 


YVn+1 = Vn + hy ® (Xn, Vii An) a hy(Rav)n, n=0,1,...,.N—-1, 
and similarly for w,+1. Subtraction then gives 
Vn+1 — Wn+1 =Vn — Wn + hy[® (Xn, Vn; hy) a ® (Xn, Wn; hy)] 
+ Aa[(Rav)n — (Raw)n]on = 0,1,...,N —1. (5.89) 


Now define 


en = Iv, = Wall, dy = CRAY n = (Raw)n|l, 6 = max dns (5.90) 


Then using the triangle inequality in (5.89) and the Lipschitz condition (5.86) for 
®, we obtain 


Cnt < (14+ hnM)en thd, n=0,1,...,N—1. (5.91) 


This is inequality (5.87) witha, = 1+h,M,b, = h,é. Since fork = 0,1,...,n—1 
andn < N, we have 


— 
& 
A 


N-1 N-1 N-1 
< I] a= [[a +hiM)< I] elt 
£=0 £=0 £=0 
eltothi te +hy—1)M = eG—a)M 


where | + x < e* has been used in the second inequality, we obtain from 
Lemma 5.7.1 that 


n—1 
6, se? 4 on + Sod 
k=0 


< eM (e) + (b—a)6), n=0,1,...,N—1. 
Taking the maximum over 7 and recalling the definition of e, and 6, we get 


Iv — Wlloo < €P-™ (||¥ — wol] + (6 — a) || Rav — RaW|loo), 


which is (5.82) with K = e@-®™ max{1,b — a}. Oo 
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We have actually proved stability for all |h| < ho, not only for |h| sufficiently 
small. The proof is virtually the same for a variable-method algorithm involving a 
family of one-step formulae {®,,}, if we assume a Lipschitz condition for each ®,, 
with a constant M independent of n. 

All one-step methods used in practice satisfy a Lipschitz condition if f does, 
and the constant M for ® can be expressed in terms of the Lipschitz constant L 
for f. This is obvious for Euler’s method, and not difficult to prove for others (see 
Ex. 13). It is useful to note that ® need not be continuous in x; piecewise continuity 
suffices, as long as (5.86) holds for all x € [a,b], taking one-sided limits at points 
of discontinuity. 

For later use, we state another application of Lemma5.7.1, relative to a grid 
function v € T;,[a, b] satisfying 


Vnt1 =Vn + hn(Anvn + bn), n=0,1,...,N —-1, (5.92) 


where A, € R?*¢, b, € R¢, and h = {h,} is an arbitrary grid on [a, b]. 


Lemma 5.7.2. Suppose in (5.92) that 
| An|| < M, |r|] < 6, n=0,1,...,N —1, (5.93) 


where the constants M, 6 do not depend on h. Then there exists a constant K > 0 
independent of h, but depending on ||vo||, such that 


IIVlloo = K. (5.94) 
Proof. The lemma follows at once by observing that 
Vn-il] < A+ AnM)||y, |] + And, n= 0,1,...,N —-1, 
which is precisely the inequality (5.91) in the proof of Theorem 5.7.1, hence 


[lyn l| < eC ©™ {||vol] + (b — a8}. (5.95) 
oO 


5.7.2 Convergence 


Stability is a rather powerful concept; it implies almost immediately convergence, 
and is also instrumental in deriving asymptotic global error estimates. We begin by 
defining precisely what we mean by convergence. 


Definition 5.7.2. Let a = x9 < x) < x2 < ++: < xy = b bea grid on [a, b] with 
grid length |h| = max (X, — Xn—1). Let u = {u,} be the grid function defined by 
l<n< 
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applying the method (5.76) (with ®,, = ®) on [a,b], and y = {y,,} the grid function 
induced by the exact solution of the initial value problem (5.16). The method (5.76) 
is said to converge on [a, b] if there holds 


Iz — Ylloo +0 as |h| > 0. (5.96) 


Theorem 5.7.2. If the method (5.76) is consistent and stable on [a,b], then it 
converges. Moreover, if ® has order p, then 


lu — ylloo = O(fh|”) as |h| > 0. (5.97) 


Proof. By the stability inequality (5.82) applied to the grid functions v = u and 
w = y of Definition 5.7.2, we have, for |h| sufficiently small, 


|Z — Ylloo S K(lluo — y(xo)l] + |] Ra — Raylloo) 
= K||Rrylloo, (5.98) 


since up = y(Xxo) and R,u = 0 by (5.76). But, by (5.81), 
Rad lloo = ITC. ¥: A)lloo, (5.99) 
where T is the truncation error of the method ®. By definition of consistency, 
ITC. ¥: Allo > 0 as |h| > 0, 


which proves the first part of the theorem. The second part follows immediately 
from (5.98) and (5.99), since order p means, by definition, that 


IT(-, ys A)lloo = O([h|?) as |h| > 0. (5.100) 


oO 

Since, as we already observed, practically all one-step methods are stable and of 

order p > 1 (under reasonable smoothness assumptions on /f'), it follows that they 
are all convergent as well. 


5.7.3 Asymptotics of Global Error 


Just as the principal error function describes the leading contribution to the local 
truncation error, it is of interest to identify the leading term in the global error u, — 
y(x,). To simplify matters, we assume a constant grid length h, although it would 
not be difficult to deal with variable grid lengths of the form h, = 0(x,)h, where 
U(x) is piecewise continuous and 0 < }(x) < © fora < x < b. Thus, we consider 
our one-step method to have the form 
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Xn+1 = Xn + h, 
Un+| = Un + h®O(xX,, uy h), n =0,1,...,N —1, 
Xo =a, Uo = Yo, (5.101) 


defining a grid function u = {u,} on a uniform grid over [a, b]. We are interested in 
the asymptotic behavior of u, — y(x,) as h — 0, where y(x) is the exact solution 
of the initial value problem 


d 
i. = f(x,y), a<x <b: y(a)= yo. (5.102) 
XxX 


Theorem 5.7.3. Assume that 


(1) ®(x, y:h) € C? on [a,b] x R¢ x [0, ho]; 
(2) ® is amethod of order p = | admitting a principal error function t (x, y) € C 
on [a,b] x R¢; 


(3) e(x) is the solution of the linear initial value problem 


. = fy(x, y(x)e+t(x,y(x)), a<x<b, 
ee (5.103) 


Then, forn =0,1,...,N, 
Un — Y(Xn) = e(Xn)A? + O(h?*!) as h > 0. (5.104) 


Before we prove the theorem, we make the following remarks. 


1. The precise meaning of (5.104) is 
lu — y —helloo = O(h”*') as h > 0, (5.105) 


where u, y, e are the grid functions uw = {u,}, y = {y(x,)},e = {e(x,)} and 
|| - lloo is the norm defined in (5.75). 

2. Since by consistency ®(x, y;0) = f(x, y), assumption (1) implies f € C? on 
[a, b]x R¢, which is more than enough to guarantee the existence and uniqueness 
of the solution e (x) of (5.103) on the whole interval [a, b]. 

3. The fact that some, but not all, components of t(x, y) may vanish identically 
does not imply that the corresponding components of e(x) also vanish, since 
(5.103) is a coupled system of differential equations. 


Proof of Theorem 5.7.3. We begin with an auxiliary computation, an estimate for 


® (xX), Un 1) — O(Xn, ¥(%); 1). (5.106) 
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By Taylor’s theorem (for functions of several variables), applied to the ith compo- 
nent of (5.106), we have 


(Xn) — On Vn): A) = D/O Cn Yn): ly — 7 Gn)] 
ie . 
a) Xu Be tn Wns yuh — yon) — 9“ On) (5.107) 


where w, is on the line segment connecting u, and y(x,,). Using Taylor’s theorem 
once more, in the variable , we can write 


', (Xn, y(Xn); h) = ®'; (Xn, y (Xn); 0) + A®’ 5, (%n, y (Xn); A), 


where 0 < h < h. Since by consistency ®(x, y;0) = f(x, y) on [a,b] x R¢, we 
have 


Oj (x, 90) = fiy), x€ [ab], y eR’, 


and assumption (1) allows us to write 
(Xn, ¥Xn)ih) = fii Xn ¥Kn)) + OC), 0. (5.108) 


Now observing that u, — y(x,) = O(hA?) by virtue of Theorem5.7.2, and using 
(5.108) in (5.107), we get, again by assumption (1), 


o! (Xn, Un; h) ~ ©! (Xn, VY (Xn); h) 


= Do Ai ns ¥ Orn) bed — 7 Gn] + OG? *1) + OP). 


j=l 
But O(h7?) is also of order O(h?*'), since p > 1. Thus, in vector notation, 


D (Xn, Uns h) — O(Xn, y (Xn) A) = fy Xn, Y(Xn)) [Un — Y(%Xn)] + oP”). 
(5.109) 


Now, to highlight the leading term in the global error, we define the grid function 
r = {r,} by 
r=h ?’(u-y). (5.110) 
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Then 
1 | ee 2 
z Pati — tn) = pa ?(Un+1 — y(Xn41)) —h (un — y(Xn))] 


1 1 
=h? E (Un+1 _ Un) _ h (y (X41) —_ v(on)| 
— h-?[®(x,, Un; h) — {®(xn, y (Xn); h) _ T(x), Y (Xn); h)}], 


where we have used (5.101) and the relation (5.81) for the truncation error T. 
Therefore, expressing T in terms of the principal error function tT, we get 


(hn41 _ rn) _ h-?[® (xn, Un; h) = PD (Xp, y (Xn); h) 
+ (Xn, Y(Xn))h? + OCP TY). 


For the first two terms in brackets, we use (5.109) and the definition of r in (5.110) 
to obtain 


1 
i (Tn+1 aa rn) = Fyn, V(Xn))Pn + T(Xn, Y(Xn)) + O(h), 


n=0,1,...,N—-1, 
ro = 0. (5.111) 


Now letting 
g(x,y) := fy(x, y(x))y + T(x, y(x)), (5.112) 


we can interpret (5.111) by writing 
(Rr) = @=0 1,2. V=D, 4 = OO), 


where gee is the discrete residual operator (5.78) that goes with Euler’s method 


applied to e’ = g(x,e), e(a) = 0. Since Euler’s method is stable on [a, b] and g 
(being linear in y) certainly satisfies a uniform Lipschitz condition, we have by the 
stability inequality (5.82) 


Ilr — elloo = O(h), 
and hence, by (5.110), 


jz — y —hPellog = OCH"), 


as was to be shown. Oo 
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5.8 Error Monitoring and Step Control 


Most production codes currently available for solving ODEs monitor local 
truncation errors and control the step length on the basis of estimates for these 
errors. Here, we attempt to monitor the global error, at least asymptotically, 
by implementing the asymptotic result of Theorem5.7.3. This necessitates the 
evaluation of the Jacobian matrix f(x, y) along or near the solution trajectory; 
but this is only natural, since fy, in a first approximation, governs the effect 
of perturbations via the variational differential equation (5.103). This equation is 
driven by the principal error function evaluated along the trajectory, so that estimates 
of local truncation errors (more precisely, of the principal error function) are needed 
also in this approach. For simplicity, we again assume constant grid length. 


5.8.1 Estimation of Global Error 


The idea of our estimation is to integrate the “variational equation” (5.103) along 
with the main equation (5.102). Since we need e(x,,) in (5.104) only to within an 
accuracy of O(h) (any O(h) error term in e(x,), multiplied by h”, being absorbed 
by the O(h?*') term), we can use Euler’s method for that purpose, which will 
provide the desired approximation v, ~ e(x,). 


Theorem 5.8.1. Assume that 


(1) ®(x, y;h) € C? on [a,b] x R¢ x [0, Aol; 

(2) ® is amethod of order p > \ admitting a principal error function t(x, y) € C! 
on [a,b] x R¢; 

(3) an estimate r(x, y;h) is available for the principal error function that satisfies 


r(x,y;h)=t(x,y)+ O(h), h-0O, (5.113) 
uniformly on [a, b] x R¢; 
(4) along with the grid function u = {u,} we generate the grid function v = {v,} 
in the following manner, 


Xn+1 = Xn + h, 


Unt) = Un + h®(xn, un;h), 
Vn+1 = Vn + hl fy (Xn, Un)Vn + 1 (Xn, Un} h)], 
Xo = 4,Uy = yo, Vo = 9. (5.114) 


Then, forn = 0,1,..., N, 


Un — Y (Xn) = Yah? + O(h?*!) ash > 0. (5.115) 
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Proof. The proof consists of establishing the following estimates, 
Sy (Xn, Un) = Fy Gn. Vn) + O(h), (5.116) 
V (Xn, Uns h) = T(Xn, V(Xn)) + OCA). (5.117) 


Once this has been done, we can argue as follows. Let (cf. (5.114)) 
&(x,y) = fy v@))y + T(x, y(x)). (5.118) 
The equation for v,,+; in (5.114) has the form 
Vatt = Vn + h(Anvn + dn), 


where A,, are bounded matrices and b, bounded vectors. By Lemma 5.7.2, we have 
boundedness of v,,, 


vy, =O(1), ho. (5.119) 


Substituting (5.116) and (5.117) into the equation for v,+ 1, and noting (5.119), we 
obtain 


Vn+1 = Vn + hi fy (Xn, V(Xn))Vn + T(Xn, Y(Xn)) + O(h)] 
=v, + hg (%n,Vn) + O(h’). 
Thus, in the notation used in the proof of Theorem 5.7.3, 


(RE y), = O(h), vo =0. 


1 


Since Euler’s method is stable, we conclude 
Vn — (Xn) = O(h), 


where e(x) is, as before, the solution of e' = g(x,e), e(a) = 0. Therefore, by 
(5.104), 


Un — Y(Xn) = e(Xn)h? + O(h?*") =v,h? + O(h?t), 


as was to be shown. 

It remains to prove (5.116) and (5.117). From assumption (1) we note, first of all, 
that f (x, y) € C? on [a, b] x R’, since by consistency f(x,y) = ®(x, y;0). By 
virtue of u, = y(x,) + O(h”) (cf. Theorem 5.7.2), we therefore have 


Sy On, Un) = Sy (Qn, y(Xn)) + O(h?), 


which implies (5.116), since p > 1. 
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Next, since t(x, y) € C! by assumption (2), we have 
T(Xp,Un) = T(Xn, V(Xn)) + Ty (Xn, Un) (Un — y(Xn)) 
= T(Xn, Y(Xn)) + OA”), 
so that by assumption (3), 
V (Xn, Uni h) = t(Xn, Un) + O(h) = T(Xn, Y(Xn)) + O(h?) + OCA), 


from which (5.117) follows at once. This completes the proof of Theorem 5.8.1. 0 


5.8.2. Truncation Error Estimates 


In order to apply Theorem 5.8.1, we need estimates r(x, y; /) of the principal error 
function t(x, y) which are O(h) accurate. A number of them, in increasing order 
of efficiency, are now described. 


1. Local Richardson extrapolation to zero: This works for any one-step method ®, 
but is usually considered to be too expensive. If ® has order p, the procedure is 
as follows. 


Yn = Y +h@(x, y;h), 


1 1 
Yaj2=Yt+ 5 he («9 sh), 


1 1 1 
_— h ~h® xh, ;xh), 
Yh Yhj2t 5 («+5 Vh/2 5 ) 


1 1 ‘ 
r(x, y;h) = Looee pont Wh - Yad: (5.120) 


Note that y; is the result of applying ® over two consecutive steps of length sh 
each, whereas yy, is the result of one application over the whole step of length h. 

We now verify that r(x, y;h) in (5.120) is an acceptable estimator. To do this, 
we need to assume that t(x, y) € C! on [a,b] x R¢. In terms of the reference 
solution u(t) through (x, y), we have (cf. (5.32) and (5.37)) 


O(x, yh) = 7 [u(x + h) — u(x)] + t(x, yh? + O(?*), (5.121) 


Furthermore, 


1 1 1 1 1 
—(yn — Ye = —-V-Tn O(x,y;h)—~=©® =h, ynjo; =I 
h (Yn — Yi) i (y — Yn/2) + O(x, y;h) 5 (: + 5 A, Vapi 5 ') 


1 1 1 1 1 
= (yh) 58 (s wish) — 50 (x4 Shmaizh). 
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Applying (5.121) to each of the three terms on the right, we find 


* Oh — JA) = “ [u(x +h) — u(x)] + 0x, yh? + O(h?*") 
Pp 
~taalele+ 3) eo] sete Go) rome 
= 5 aa |uet mu (x4 5 )| 


1 1 1.\? 
- st (s+ gy t own) (5") 
+ O(h?T!) = r(x, y)(1 —27?)h? + O(h? T). 


Consequently, 


1 1 ; 
jaa pO ee PO) (5.122) 
as required. 

Subtracting (5.122) from (5.121) shows, incidentally, that 


‘ A P 
O* (x, y3h) = B(x, VS) — ay Ot wd (5.123) 


—2-? h 
defines a one-step method of order p + 1. 

The procedure in (5.120) is rather expensive. For a fourth-order Runge-Kutta 
process, it requires a total of 11 evaluations of f per step, almost three times 
the effort for a single Runge-Kutta step. Therefore, Richardson extrapolation is 
normally used only after every two steps of ®; that is, one proceeds according to 


Yn =Yt+h&(x, y;h), 


Yon = In +hAO(x +A, yah), 
Yon = y + 2N@(x, y; 2h). (5.124) 


Then (5.122) gives 


1 1 
20? — 1 jeri (Yon — Yo) = T(x, y) + O(h), (5.125) 


so that the expression on the left is an acceptable estimator r(x, y;/). If the two 
steps in (5.124) yield acceptable accuracy (cf. Sect. 5.8.3), then, again for a fourth- 
order Runge-Kutta process, the procedure requires only three additional evaluations 
of f, since y;, and y5, would have to be computed anyhow. We show, however, that 
there are still more efficient schemes. 
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2. Embedded methods: The basic idea of this approach is very simple: if the given 
method ® has order p, take any one-step method ®* of order p* = p + 1 and 
define 


r(x, y;h) = (B(x, yh) — @(x, yh) (5.126) 
This is indeed an acceptable estimator, as follows by subtracting the two relations 
(0c, ys) — Tle +h) — u(x] = x(x, yh? + OCH, 
O°(x, sh) — 7 [u(x + A) —u(x)] = OCH") 


and dividing the result by h?. 

The tricky part is making this procedure efficient. Following an idea of Fehlberg, 
one can try to do this by embedding one Runge-Kutta process (of order p) into 
another (of order p + 1). Specifically, let ® be some explicit r-stage Runge-Kutta 
method, 


kKi(x,y) = f(x,y), 
s-l 


k(x, ysh)= fl xtush,y +h Do sj kj 8 = 2, Biase, Fy 
j=l 


(x, ysh) = > asks. 
s=1 


Then for ®* choose a similar r*-stage process, with r* > r, in such a way that 
i Sy Mi =A, fors =2,3,...,7r. 


The estimate (5.126) then costs only r* — r extra evaluations of f. If r* = r + 1, 
one might even attempt to save the additional evaluation by selecting (if possible) 


fre =A deep Say forj =1,2,....0-1 "=r41). 6.127) 


Then indeed, k.« will be identical with k, for the next step. 

Pairs of such embedded (p, p + 1) Runge-Kutta formulae have been developed 
in the late 1960s by E. Fehlberg. There is a considerable degree of freedom in 
choosing the parameters. Fehlberg’s choices were guided by an attempt to reduce 
the magnitude of the coefficients of all the partial derivative aggregates that enter 
into the principal error function t(x, y) of ® (cf. the end of Sect.5.6.4 for an 
elementary example of this technique). He succeeded in obtaining pairs with the 
following values of parameters p, r, r* (Table 5.1). 
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Table 5.1 Embedded 
Runge-Kutta formulae 


Dp r r 
3 4 5 
4 5 6 
5 6 8 
6 8 10 
7 11 13 
8 15 17 


For the third-order process (and only for that one), one can also arrange for (5.127) 
to hold (cf. Ex. 15 for a second-order process). 


5.8.3 Step Control 


Any estimate r(x, y;/) of the principal error function t (x, y) implies an estimate 
h? r(x, yih) = T(x, y:h) + O(h?t!) (5.128) 


for the truncation error, which can be used to monitor the local truncation error 
during the integration process. However, one has to keep in mind that the local 
truncation error is quite different from the global error, the error that one really 
wants to control. To get more insight into the relationship between these two errors, 
we recall the following theorem, which quantifies the continuity of the solution of 
an initial value problem with respect to initial values. 


Theorem 5.8.2. Let f (x,y) be continuous in x fora < x < b and satisfy a 
Lipschitz condition uniformly on [a,b] x R@ with Lipschitz constant L (cf. (5.27)). 
Then the initial value problem 


d 
=e a). a<x<b, 

dx 
y(c) = Ye, (5.129) 


has a unique solution on [a,b] for any c with a < c < b and for any y. € R¢. 
Let y(x;s) and y(x;s*) be the solutions of (5.129) corresponding to ye = 8 and 
Yo = 8*, respectively. Then, for any vector norm || - ||, 


lly (xs 8) — yQxss*)|] < e4 Pl ]s — s* |]. (5.130) 


Solving the given initial value problem (5.102) numerically by a one-step method 
(not necessarily with constant step) in reality means that one follows a sequence of 
“solution tracks,” whereby at each grid point x, one jumps from one track to the 
next by an amount determined by the truncation error at x, (cf. Fig. 5.2). This is 
so by the very definition of truncation error, the reference solution being one of the 
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ZL 


e 


(7? 


a=Xy x Xy x3 Xu XAy=S 


Fig. 5.2. Error accumulation in a one-step method 


solution tracks. Specifically, the nth track,n = 0,1,...,N, is given by the solution 
of the initial value problem 
d n 
7h =f (x,0n), Xn SX SD, 
dx 
Vn (Xn) = Un, (5.131) 
and 
Unt = Vn (Xn41) + nT (Xn. Uns hn), = 0,1,...,N—1. (5.132) 


Since by (5.131) we have un41 = Vn4+1(%+4+1), we can apply Theorem 5.8.2 to the 
solutions y,+4; and y,, letting ¢ = X41, 8 = Un41, 8* = Ung, — An T (%, Uni hn) 
(by (5.132)), and thus obtain 


([Pn41(X) —Vn(x) || < Aged 41 T (en, Uns An) ||,n = 0,1,...,N —1. (5.133) 


Now 
N-1 
Y= feng 1) — ¥n(x)] = vw (x) — vo(x) = vw (x) — y(), (5.134) 
n=0 
and since vy (xy) = uy, letting x = xy, we get from (5.133) and (5.134) that 
N-1 
lav — yaw) < D> [vn siCew) — vn(aw) | 
n=0 
N-1 


= » hyet w+ T (tq, ns hn) |. 
n=0 
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Therefore, if we make sure that 
[? Gatti) | Ser, HHL 2g NHL (5.135) 


then 
N-1 
lew — yCxw)Il Ser Denti — Xnerhy etl, 


n=0 


Interpreting the sum on the right as a Riemann sum for a definite integral, we finally 
obtain, approximately, 


b 
- er 4 
luv — yw) ser f ebO-Yax =F (eho — 1), 


a 


Thus, knowing an estimate for L would allow us to set an appropriate er, 
namely, 
L 
eT = ena] g, (5.136) 

to guarantee an error ||uy — y(xw)||_ < ¢. What holds for the whole grid on [a, b], of 
course, holds for any grid on a subinterval [a, x], a < x < b. So, in principle, given 
the desired accuracy ¢ for the solution y(x), we can determine a “local tolerance 
level” e7 by (5.136) and achieve the desired accuracy by keeping the local truncation 
error below er (cf. (5.135)). Note that as L — 0 we have ey > €/(b — a). This 
limit value of ¢7 would be appropriate for a quadrature problem but definitely not 
for a true differential equation problem, where ¢7, in general, has to be chosen 
considerably smaller than the target error tolerance ¢. 

Considerations such as these motivate the following step control mechanism: 
each integration step (from x, to X)4+1 = X, + hy) consists of these parts: 


1. Estimate h,,. 

2. Compute Unti = Un + hy O (x, Un; hy) and r (Xn Un; hy). 

3. Test h? |Ir (xp, Uni hn)|| < er (cf. (5.128) and (5.135)). If the test passes, proceed 
with the next step; if not, repeat the step with a smaller h,,, say, half as large, until 
the test passes. 


To estimate h,, assume first that n > 1, so that the estimator from the previous 
step, Fr (X,»—1, Un—1; Mn—1) (or at least its norm), is available. Then, neglecting terms 


of O(h), 


|Z (Xn—1; Un—1)| ~ Ir (Xn-1, Un—13 An—1)||, 


and since T(X;,Un) © T(Xp»—1, Un—1), likewise 


|= (Xn, un)|| x |r (Xn-1, Un—-1; hy-1)|\. 
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What we want is 

\|t (Xn ey Un) |? ~ Jer, 
where @ is “safety factor,” say, 9 = 0.8. Eliminating t(x,, u,), we find 


ber ue 


|r (Xn-1, Un—-1; hn—-1)|\ 


hy & 


Pp 


Note that from the previous step we have h),_, ||r (%n—1, Un—1; An—1)|| < €7, so that 


hy Zz 6'/P? hy, ’ 


and the tendency is toward increasing the step. 
Ifn = 0, we proceed similarly, using some initial guess h® of ho and associated 
r (Xo; Yo; hw) to obtain 


1/p 


Ilr (x0, vos Ay )I 


The process may be repeated once or twice to get the final estimate of fo and 
Ilr (x0, Yo; hol. 


5.9 Stiff Problems 


Although there is no generally accepted definition of stiffness’ of differential 
equations, a characteristic feature of stiffness is the presence of rapidly changing 
transients. This manifests itself mathematically in the Jacobian matrix f having 
eigenvalues with very large negative real parts along with others of normal 
magnitude. Standard (in particular explicit) numerical ODE methods are unable 
to cope with such solutions unless they use unrealistically small step lengths. 
What is called for are methods enjoying a special stability property called A- 
stability. We introduce this concept in the context of linear homogeneous systems 
of differential equations with constant coefficient matrix. Padé approximants to the 
exponential function turn out to be instrumental in constructing A-stable one-step 
methods. 


7The word “stiffness” comes from the differential equation governing the oscillation of a “stiff” 
spring, that is, a spring with a large spring constant. 
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5.9.1 A-Stability 


A model problem exhibiting stiffness is the linear initial value problem 


d 
~ = Ay, 0<x < 00; y(a) = yo, (5.137) 


where A € R¢*@ is a constant matrix of order d having all its eigenvalues in the 
left half-plane, 

ReA;(A) <0, i = 1,2,...,d. (5.138) 
It is well known that all solutions of the differential system in (5.137) then decay 
exponentially as x —> oo. Those corresponding to eigenvalues with very large 


negative parts do so particularly fast, giving rise to the phenomenon of stiffness. 
In particular, for the solution y (x) of (5.137), we have 


y(x) ~ 0 asx > 00. (5.139) 


How does a one-step method ® behave when applied to (5.137)? First of all, a 
generic step of the one-step method will now have the form 


Vnext = y + h®(x, y;h) = g(hA)y, (5.140) 


where ¢ is some function, called the stability function of the method. In what follows 
we assume that the matrix function g(A) is well defined; minimally, we require 
that yg: C — C is analytic in a neighborhood of the origin. Since the reference 
solution through the point (x, y) is given by u(t) = e4”— y, we have for the 
truncation error of ® at (x, y) (cf. (5.31)) 


1 1 
T(x, ysh) = i [next — u(x + h)] = . [p(hA) —e"4]y. (5.141) 


In particular, the method ® in this case has order p if and only if 
& = 9) +0"), 20. (5.142) 
This shows the relevance of approximations to the exponential function in the 
context of one-step methods applied to the model problem (5.137). 
The approximate solution wu = {u,} to the initial value problem (5.137), 


assuming for simplicity a constant grid length h, is given by 


Un+i = Q(hA)u,, n =0,1,2,...3 uo = yo; 
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hence 


un = [g(hA)]" yo, n = 0,1,2,.... (5.143) 
This will simulate the behavior (5.139) of the exact solution if and only if 


lim [p(hA)}" = 0. (5.144) 


A necessary and sufficient condition for (5.144) to hold is that all eigenvalues of the 
matrix g(hA) be strictly within the unit circle. This in turn is equivalent to 


|p(ha,;(A))| < 1 fori = 1,2,...,d, (5.145) 


where A;(A) are the eigenvalues of A. In view of (5.138), this gives rise to the 
following definition. 


Definition 5.9.1. A one-step method ® is called A-stable if the function gy associ- 
ated with ® according to (5.140) is defined in the left half of the complex plane and 
satisfies 


|(z)| < 1 forall z with Rez < 0. (5.146) 


We are led to the problem of constructing a function g (and with it, a one- 
step method ®), which is analytic in the left-half plane, approximates well the 
exponential function near the origin (cf. (5.142)), and satisfies (5.146). An important 
tool for this is Padé approximation to the exponential function. 


5.9.2 Padé® Approximation 


For any function g(z) analytic in a neighborhood of z = 0, one defines its Padé 
approximants as follows. 


Definition 5.9.2. The Padé approximant R[n,m|](z) to the function g(z) is the 
rational function 


Rinvmi]@) = TS, Pe Pm OEP (5.147) 
satisfying 
g(z)O(z) — P(z) = O(z"*"*!) as z> 0. (5.148) 


SHenri Eugéne Padé (1863-1953), a French mathematician, was educated partly in Germany and 
partly in France, where he wrote his thesis under Hermite’s supervision. Although much of his 
time was consumed by high administrative duties, he managed to write many papers on continued 
fractions and rational approximation. His thesis and related papers became widely known after 
Borel referred to them in his 1901 book on divergent series. 
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Thus, when expanding the left-hand side of (5.148) in powers of z, all initial 
terms should drop out up to (and including) the one with power z’t"". It is known 
that the rational function R[n,m] is uniquely determined by this definition, even 
though in exceptional cases P and QO may have common factors. If this is not the 
case, that is, P and OQ are irreducible over the complex numbers, we assume without 
loss of generality that Q(0) = 1. 

Our interest here is in the function g(z) = e*. In this case, P = P[n,m] and 
Q = Q[n, m] in (5.147) and (5.148) can be explicitly determined. 


Theorem 5.9.1. The Padé approximant R|n, m] to the exponential function g(z) = 
e* is given by 


_ “. mintm—k)! 
Pin, mi) = » (m—k)\(n +m)! kl’ ne 


k +m—k)! 2 
Q[n,m](z) = Ye Ga (5.150) 
Moreover, 
P{[n, m](z) tm+1 
aa ae dinars 
Ofnmia 
where 
n'm! 
nm — —1)" : 151 
Cn, ED (n+m)!(n +m + 1)! ony 
Proof. Let 


v(t) := ¢"(1—-14)”. 


By Leibniz’s rule, one finds 


v(t) = > (,) ["} a = iy 


k=0 


“El )(: Jee (Jemima 


hence, in particular, 


v0) = (-1)"™ (,”,,)" ifr >n; VO) =0 ifr <n; 


v1) = (-1)” (, ",)" ifr >m;v(1)=0 ifr<m. (5.152) 
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Given any integer g > 0, repeated integration by parts yields 


1 q (1) q (r) 0) 
: . _v v 
i: ey(t)dt = e y ae) eer ees > fly - 
0 r=0 z r=0 as 


(-1et! 


1 
q+ / et Dp) de. (5.153) 
z 0 


Putting here gq = n + m, so that v9t)(t) = 0, and multiplying by (—1)¢z4t! 
gives 


q q 
Cite CI Oe" = Gly) Ee 


r=0 r=0 


+ og", aes 0, 


where the O-term comes from multiplying the integral on the left of (5.153) (which 
is O(1) as z > 0) by 24! = z"*"+1 Th the first sum it suffices, by (5.152), to 
sum over r > m, and in the second, to sum over r > n. Using the new variable of 
summation k defined by g —r = k, we find 


m 


e& Yep vet Haye -_ Yok vetn yz _ oct ly, 


k=0 k=0 


which clearly is (5.148) for g(z) = e*. It now suffices to substitute the values (5.152) 
for the derivatives of v at 0 and 1 to obtain (5.149) and (5.150), after multiplication 
of numerator and denominator by (—1)’"/(n-++m)! Tracing the constants, one readily 
checks (5.151). Oo 


The Padé approximants to the exponential function have some very useful and 
important properties. Here are those of interest in connection with A-stability: 


1. P[n,m](z) = O|[m,n](—z): The numerator polynomial is the denominator 
polynomial with indices interchanged and z replaced by —z. This reflects the 
property 1 /e* = e~ of the exponential function. The proof follows immediately 
from (5.149) and (5.150). 

2. For eachn = 0,1,2,..., all zeros of Q|n,n]| have positive real parts (hence, 
by (1), all zeros of P[n,n] have negative real parts). A proof can be given by 
applying the Routh-Hurwitz criterion? for stable polynomials. 


°The Routh—Hurwitz criterion states that a real polynomial agx” + a,x"—! +.-++++ dy, ao > 0, 
has all its zeros in the left half of the complex plane if and only if all leading principal minors 
of the nth-order Hurwitz matrix H are positive. Here the elements in the ith row of H are 
2—j,d4—j,..-, A2n—; (where a, = Oifk <Oork >n). 
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3. For realt € R, andn = 0,1,2,..., there holds 


Pin, n\(it) 
Q[n,n|Gt) 


Indeed, by property (1), one has P[n,n](it) = Q[n, n](it). 
4. There holds 


P[n + 1,n|(it) 


——_———| <1 forte R, +t #0,n=0,1,2.,.... 
QOf[n + 1, n](it) f 7 


The proof follows from the basic property (5.148) of the Padé approximant: 
e O[n + 1,n](it) — P[n + 1,n](it) = O(\t/?"*7), t > 0. 
Taking absolute values, and using the triangle inequality, gives 


| |O[n + 1,n](it)| — |P[n + 1,n](t)| | < lel’ O[n + 1, n](it) 
— Pin + 1,n](it)| = O((t/?"*”); 


that is, 
|O[n + 1,n](@it)| —|P[n + 1,n](it)| = O(t|?"*?). 


Multiply this by |Q[n + 1, n](it)| + |P[n + 1,n](it)| to obtain 
|O[n + 1,n]Git)|? —|P[n + 1,n](it)P = O(e?"t*), t +0, (5.154) 


where the order term is unaffected since both Q and P have the value 1 at 
t = 0 (cf. (5.149) and (5.150)). But |P[n + 1,n](@it)/? = P[n + 1,n](it) - 
P[n + 1,n](-it) is a polynomial of degree n in f?, and similarly, 
|O[n + 1,n](it)|* a polynomial of degree n + 1 in t?. Thus, (5.154) can hold 
only if 

|O[n + 1, n]Gs)/? — | Pin + 1 njGn)P = at", 


where at?”*? is the leading term in |Q[n + 1,n](it)|’, hence a > 0. From this, 
the assertion follows immediately. 

5. For eachn = 0,1,2,..., all zeros of O[n + 1,n] have positive real parts. We 
sketch the proof. From (5.149) and (5.150), one notes that 


Oln + 1,n](z) + Pn + 1,n](—z) = 20[n + 1,n + 1](2). 


We now use Rouché’s theorem to show that Q[n + 1,n] and Q[n + 1,n + 1] 
have the same number of zeros with negative real part (namely, none according to 
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Fig. 5.3. The contour Cr 


property (2)). For this, one must show that on the contour Cr: {z = it, |t| < R, 
t A 0} U {|z| = R, Re z < 0} (see Fig. 5.3) one has, when R is sufficiently 
large, 


|P[n + 1,n](—z)| < |Oln + InJ@|, Q[n + 1,n]0) £0. 


For z = it, since P[n + 1,n](—it) = P[n + 1,n](it), the inequality follows 
from property (4). For |z| = R, the inequality holds for R — oo, since deg 
P[n + 1,n] < deg Q[n + 1,n]. Finally O[n + 1,n](0) = 1. 

Note also from (4) that Q[n + 1,n] cannot have purely imaginary zeros, since 
|O[n + 1,n](it)| > |Pl[n + 1, n](Git)| = 0 fort F 0. 

6. A rational function R satisfies |R(z)| < 1 for Rez < 0 ifand only if R is analytic 

in Rez < Oand |R(z)| < 1 forRez = 0. 

Necessity. If |R(z)| < 1 in Rez < 0, there can be no pole in Rez < 0 or at 
z = oo. By continuity, therefore, | R(z)| < 1 on Rez = 0. 

Sufficiency. R must be analytic in Rez < O and at z = ow. Clearly, 
jim, |R(z)| < 1 and |R(z)| < 1 on the imaginary axis. Then, by the maximum 
principle, | R(z)| < 1 for Rez < 0. 


7. As a corollary of property (6), we state the following important properties. 
For eachn = 0,1,2,..., there holds 


P{n,n\(z) 

aera = ones: (5.155) 
Pin + 1,n]@) 
OMA a eee (5.156) 


The first of these inequalities follows from properties (1) through (3), the second 
from properties (4) and (5). 
Property (7) immediately yields the following theorem. 
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Theorem 5.9.2. If the function ~ associated with the one-step method ® according 
to (5.140) is either the Padé approximant (z) = R{n,n](z) of e%, or the Padé 
approximant ~(z) = Rin + 1,n](z) of &,n = 0,1,2,..., then the method ® is 
A-stable. 


5.9.3 Examples of A-Stable One-Step Methods 


1. Implicit Euler method: Also called the backward Euler method, this is the one- 
step method defined by 


Un+1 = Un + hf (Xn41, Un+41)- (5.157) 


It requires, at each step, the solution of a system of (in general) nonlinear 
equations for u,+41 € IR¢. In the case of the model problem (5.137), this becomes 
Un+| = Uy +hAu, +, and can be solved explicitly: u,4,; = (I—hA)~!u,. Thus, 
the associated function @ here is 


1 
COS og = ee os (5.158) 


the Padé approximant R[1, 0](z) of e®. Since g(z) — e& = O(z*) as z > O, the 
method has order p = 1 (cf. (5.142)), and by Theorem 5.9.2 is A-stable. (This 
could easily be confirmed directly.) 

2. Trapezoidal rule: Here 


1 
Un+1 = Uy + 2 h Lf On, Un) + SF (On41, Un+il, (5.159) 


again a nonlinear equation in w,4,. For the model problem (5.137), 
this becomes u,4, = (I + 5hA) uy + ShAun+, hence uwy4+, = 


(I - 1A) (I + $hA) u,, and 


1 


1+ j 1 
g(2) = —=1ltzt+aP4—H4-. (5.160) 
1— 52 2 4 


This is the Padé approximant g(z) = R[1, 1](z) of e% and g(z) — e* = O(z°), so 
that again the method is A-stable, but now of order p = 2. 

3. Implicit Runge-Kutta formulae: As mentioned in Sect. 5.6.5, an r-stage implicit 
Runge-Kutta formula has the form (cf. (5.68)) 


O(x,y;h) = So asks(x, ysh), 


s=l1 


kes fe PGR Aghf |. 8H 12nd. C16) 
j=l 
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It is possible to show (cf. Notes to Sect. 5.6.5) that (5.161) is a method of order 
Dp. <p <2r,if f € C” on [a,b] x R¢ and 


r wet 
Dost =e KaGbe robs = L 2. 6.162) 
j= 
yee k=0,1 p-1 (5.163) 
= sMs R21’ ea Weer . 3 
For any set of distinct ;, and for each s = 1,2,...,r, the equations (5.162) 


represent a system of linear equations for {Aj f= whose coefficient matrix is 
a Vandermonde matrix, hence nonsingular. It thus can be solved uniquely for the 
{As}. Both conditions (5.162) and (5.163) can be viewed more naturally in terms 
of quadrature formulae. Indeed, (5.162) is equivalent to 


bs is 
i p(t)dt = Yo As Ps), all p € P,-1, (5.164) 
0 


j=l 


whereas (5.163) means 


1 r 
i q(t)dt =) asq(us), allg € Pp. (5.165) 
0 


s=l1 


We know from Chap. 3, Sect.3.2.2, that in (5.165) we can indeed have r < 
Pp < 2r, the extreme values corresponding to Newton—Cotes formulae (with 
prescribed jz;) and to the Gauss—Legendre formula on [0,1], where the jz; are 
the zeros of the (shifted) Legendre polynomial of degree r. In the latter case, we 
obtain a unique r-stage Runge-Kutta formula of order p = 2r. We now show 
that this Runge-Kutta method of maximum order 27 is also A-stable. 

Instead of the system (5.137), we may as well consider a scalar equation 


= = hy, (5.166) 


to which (5.137) can be reduced by spectral decomposition (i.e., A represents 
one of the eigenvalues of A). Applied to (5.166), the k; corresponding to (5.161) 
must satisfy the linear system 


ks =Aly tho Agjikj 


fl 


that is (with z = Ah), 


Gas gy Se, CH 1a 


j=l 
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Let 

1—zAy ZA —zA1, 
—zAr 1—2zA —ZA, 

d,(z) = ZA21 ZA22 ZA2 
—ZAy —ZAy2 1 — ZArr 
1 —2z11 1 —ZA1¢ 

igys| Oe he ee | £1 2k 
Ay cee WY aes Teg 


where the column of ones is the sth column of the determinant. Clearly, d, and 
d,s are polynomials of degree r and r — 1, respectively. By Cramer’s rule, 


d,s(Z) 
ee OE Ct oe 
naar oC eae . 


so that 


Ynext = Y +h>> asks = 


s=l1 


d,(z) 


Thus, the function ¢ associated with the method ® corresponding to (5.161) is 


1+ a Os rea y. 


s=1 


d,(2) +2). asdy.s(z) 


_ s=1 
p(z) = L@ (5.167) 


We see that ¢ is a rational function of type [r, r] and, the method ® having order 
Pp = 2r, we have (cf. (5.142)) 


e&é = g(z) a Og), z= 0. 


It follows that g in (5.167) is the Padé approximant R[r,7] to the exponential 
function, and hence ® is A-stable by Theorem 5.9.2. 
4. Ehle’s method This is a method involving total derivatives of f (cf. (5.45)), 


®(x, y;h) = k(x, y;h), 


k= Son asf! Vay) — Bs fa +h, y + hk). (5.168) 


s=l 
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It is also implicit, as it requires the solution of the second equation in (5.168) for 
the vector k € R¢. A little computation (see Ex. 21) will show that the function 
y associated with ® in (5.168) is given by 


1+ b> asz 


g() = — —. (5.169) 


1+ >» Bsz 
s=1 


By choosing this g to be a Padé approximant to e*, either R[r,r] or R[r,r — 1] 
(by letting a, = 0), we again obtain two A-stable methods. The latter has the 
additional property of being strongly A-stable (or L-stable), in the sense that 


y(z) > 0as Rez > —ov. (5.170) 


This means, in view of (5.143), that convergence u, — 0 as n — ow is faster 
for components corresponding to eigenvalues further to the left in the complex 
plane. 


5.9.4 Regions of Absolute Stability 


For methods ® that are not A-stable, it is important to know the region of absolute 
stability, 


Ds ={zEC: |y(d| < 1}. (5.171) 


If the method ® applied to the model problem (5.137) is to produce an approximate 
solution wu = {u,} with im. u, = 90, it is necessary that hA;(A) € Dy for all 


eigenvalues A;(A) of A. If s Soria of these have very large negative real parts, then 
this condition imposes a severe restriction on the step length /, unless D4 contains a 
large portion of the left-hand plane. For many classical methods, unfortunately, this 
is not the case. For Euler’s method:, for example, we have y(z) = 1 + z, hence 


Dag = {zEC: |1+2| <1} (Euler), (5.172) 
and the region of absolute stability is the unit disk in C centered at —1. More 


generally, for the Taylor expansion method of order p > 1, and also for any p- 
stage explicit Runge-Kutta method of order p, 1 < p < 4, one has (see Ex. 18) 


1 ll 1 
g(2=1l+—z2+2-74+---4+—27. (5.173) 
We" 2 p! 
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Fig. 5.4 Regions of absolute stability for pth-order methods with ¢ as in (5.173), 
p=1,2,..., 21 


To compute the contour line |g(z)| = 1, which delineates the region D4, one can 
find a differential equation for this line and use a one-step method to solve it (see 
MA 4). That is, we can use a method ® to analyze its own stability region — a case 
of self-analysis, as it were. The results for g in (5.173) and p = 1,2,...,21 are 
plotted in Fig. 5.4!°. Because of symmetry, only the parts of the regions in the upper 
half-plane are shown. 


5.10 Notes to Chapter 5 


The classic text on the numerical solution of nonstiff ordinary differential equations 
is Henrici [1962]. It owes much to the pioneering work of Dahlquist [1956] a 
few years earlier. Numerous texts have since been written with varying areas of 
emphasis, but all paying attention to stiff problems. We mention only a few of 
the more recent ones: the balanced exposition in Lambert [1991], the two-volume 
work of Hairer et al. [1993] and Hairer and Wanner [2010], which is especially 


‘Figure 5.4 is reproduced, with permission, from W. Gautschi and J. Waldvogel, Contour plots of 
analytic functions, in Solving problems in scientific computing using Maple and MATLAB, Walter 
Gander and Jifi Hrebiéek, eds., 4th ed., Springer, Berlin, 2004. 
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rich in historical references, interesting examples, and numerical experimentation, 
Shampine [1994] emphasizing software issues, Iserles [2009] including also topics 
in numerical partial differential equations, Ascher and Petzold [1998] treating also 
differential algebraic equations, Bellen and Zennaro [2003] focusing on delay 
differential equations, Shampine et al. [2003] using Matlab as a problem solving 
environment, Butcher [2008], and Atkinson et al. [2009], which also provides access 
to Matlab codes. 

An authoritative work on Runge—Kutta methods is Butcher [1987]. Error and sta- 
bility analyses of Runge-Kutta methods in the context of stiff nonlinear differential 
equations are the subject of the monograph by Dekker and Verwer [1984]. 


Section 5.1. The example (5.3) is from Klopfenstein [1965]. The technique de- 
scribed therein is widely used in codes calculating trajectories of space vehicles. 
For a detailed presentation of the method of lines, including Fortran programs, see 
Schiesser [1991]. 


Section 5.2. There are still other important types of differential equations, for 
example, singularly perturbed equations and related differential algebraic equations 
(DAEs), and differential equations with delayed arguments. For the former, we refer 
to Griepentrog and Marz [1986], Brenan et al. [1996], Hairer et al. [1989], and 
Hairer and Wanner [2010, Chaps. 6—7], for the latter to Pinney [1958], Bellman and 
Cooke [1963], Cryer [1972], Driver [1977], and Kuang [1993]. 


Section 5.3. For a proof of the existence and uniqueness part of Theorem 5.3.1, see 
Henrici [1962, Sect. 1.2]. The proof is given for a scalar initial value problem, but 
it extends readily to systems. Continuity with respect to initial data is proved, for 
example, in Coddington and Levinson [1955, Chap. 1, Sect. 7]. Also see Butcher 
[1987, Sect. 112]. 

A strengthened version of Theorem5.3.1 involves a one-sided Lipschitz 
condition, 


f(xy) — fx. yy —y*) SAlly —y* |’, all x = a, ally, y* ER’, 


where A is some constant — the one-sided Lipschitz constant. If this holds, and f is 
continuous in x, then the initial value problem (5.16) has a unique solution on any 
interval [a, b], b > a (cf. Butcher [1987, Sect. 112]). 


Section 5.4. The numerical solution of differential equations can sometimes benefit 
from a preliminary transformation of variables. Many examples are given in Daniel 
and Moore [1970, Part 3]; an important example used in celestial mechanics to 
regularize and linearize the Newtonian equations of motion are the transformations 
of Levi—Civita, and of Kustaanheimo and Stiefel, for which we refer to Stiefel and 
Scheifele [1971] for an extensive treatment. The reader may wish to also consult 
Zwillinger [1992b] for a large number of other analytical tools that may be helpful 
in the numerical solution of differential equations. 
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The distinction between one-step and multistep methods may be artificial, as 
there are theories that allow treating them both in a unified manner; see, for example, 
Stetter [1973,Chap.5], Butcher [1987, Chap. 4], or Hairer et al. [1993, Chap. 3, 
Sect. 8]. We chose, however, to cover them separately for didactical reasons. 

There are many numerical methods in use that are not discussed in our text. 
Among the more important ones are the extrapolation methods of Gragg and of 
Gragg, Bulirsch, and Stoer, which extend the ideas in Chap.3, Sect. 3.2.7, to 
differential equations. For these, we refer to Stetter [1973, Sect. 6.3] and Hairer 
et al. [1993, Chap. 2, Sects. 8, 9]. (Extrapolation methods for stiff problems are 
discussed in Hairer and Wanner [2010,Chap.4, Sect. 9].) Both these texts also 
contain accounts of multistep methods involving derivatives, and of Nordsieck- 
type methods, which carry along not only function values but also derivative values 
from one step to the next. There are also methods tailored to higher-order systems 
of differential equations; for second-order systems, for example, see Hairer et al. 
[1993, Chap. 2, Sect. 14]. Very recently, so-called symplectic methods have created 
a great deal of interest, especially in connection with Hamiltonian systems. These 
are numerical methods that preserve invariants of the given differential system; cf. 
Sanz-Serna and Calvo [1994], Hairer et al. [2006]. 


Section 5.5. A nonstiff initial value problem (5.16) on [a, b], where b is very large, 
is likely one that also has a very small Lipschitz constant. The problem, in this case, 
is not properly scaled, and one should transform the independent variable x, for 
example by letting x = (1 —t)a + tb, to get an initial value problem on (0, 1], 
namely, dz/dt = g(t,z),0 <t < 1,z(0) = yo, where z(t) = y((1—t)a + tb) and 
g(t,z):= (b—a) f((1—t)a+tb,z). If f has a (very small) Lipschitz constant L, 
then g has the Lipschitz constant (b — a)L, which may well be of more reasonable 
size. 


Section 5.6.1. In the spirit of Laplace’s exhortation “Lisez Euler, lisez Euler, c’est 
notre maitre a tous!,” the reader is encouraged to look at Euler’s original account of 
his method in Euler [1768, Sect. 650]. Even though it is written in Latin, Euler’s use 
of this language is plain and simple. 


Section 5.6.2. The method of Taylor expansion was also proposed by Euler [op.cit., 
Sect. 656]. It has long been perceived as being too cumbersome in practice, but 
recent advances in automatic differentiation helped to revive interest in this method. 
Codes have been written that carry out the necessary differentiations systematically 
by recursion (Gibbons [1960] and Barton et al. [1971]). Combining these techniques 
with interval arithmetic, as in Moore [1979, Sect. 3.4], also provides rigorous error 
bounds. 


Section 5.6.5. For the results in (5.70), see Butcher [1965]. Actually, p*(10) = 7, 
as was shown more recently in Butcher [1985]. The highest orders of an explicit 
Runge-Kutta method ever constructed are p = 12 with 17 stages, and p = 14 with 
35 stages (Feagin [2011]). 
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It has become customary to associate with the general r-stage Runge—Kutta 
method (5.68) the array 


eal An Ay tts Ady 
yy X a es 
H2 21 22 2 ( . iA 
in matrix form: ae ; 
; : a! 
Ly Ari Ar2 Arr 
Qy 2 - Ql, 


called the Butcher array. For an explicit method, 4; = 0 and A is lower triangular 
with zeros on the diagonal. With the first r rows of the Butcher array, we may 
associate the quadrature rules [}" u(t)dt ~ ai Asju(uj), § = 1,2,..., r, and 


with the last row the rule to u(t)dt ~ )~’_, asu(fs). If the respective degrees of 
exactness are d; = gy —1,1<s <r+1(d; = oo if ws = OandallA,; = 0), then 
by the Peano representation of error functionals (cf. Chap. 3, (3.79)) the remainder 
terms involve derivatives of u of order g;, and hence, setting u(t) = y’(x + th), 
one gets 


y(x + sh) — y(x) 
h 


— SC Asyy'(e + jh) = O(h”), s =1,2,...,7, 


j=l 


and 


y(x +h) — y(x) 


= aye + sh) = OCF), 


s=1 


The quantity g = min(q1, q2,...,q,) is called the stage order of the Runge-Kutta 
formula, and g,+ the quadrature order. 

High-order r-stage implicit Runge—Kutta methods have the property that, when 
S(x,y) = f(x), they reduce to r-point Gauss-type quadrature formulae, either 
the Gauss formula proper, or the Gauss—Radau (44; = 0 or uw, = 1) or Gauss— 
Lobatto (44; = 0 and jz, = 1) formula; see, for example, Dekker and Verwer [1984, 
Sect. 3.3], Butcher [1987, Sect. 34], Lambert [1991, Sect.5.11], and Hairer and 
Wanner [2010, Chap. 4, Sect. 5]. They can be constructed to have order 27, and (ina 
variety of ways) orders 2r—1 and 27 —2, respectively; cf. also Sect. 5.9.3(3). Another 
interesting way ofconstructing implicit Runge-Kutta methods is by collocation: 
define p € P, to be such that p(x) = y, p'(x + push) = f(x + Ush, p(x + Ush)), 
s = 1,2,...,r (cf. Chap. 2, Ex. 66), and let Ynexe = p(x + A). It has been shown 
by Wright [1970] (also cf. Butcher [1987, Sect. 346]) that this indeed is an implicit 
r-stage method — a collocation method, as it is called. For such methods, the 
stage orders are at least r. This property characterizes collocation methods of 
orders > r (with distinct 445); see Hairer et al. [1993, Theorem 7.8, p. 212]. The 
order of the method is p = r +k, k > O, if the quadrature order is p (cf. 
[loc.cit., Chap. 2, Theorem 7.9] and Chap.3, Theorem 3.2.1). Some (but not all) 
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of the Gauss-type methods previously mentioned are collocation methods. If the 
polynomial p(x + th) is determined explicitly (not just p(x + /)), it provides a 
means of computing intermediate approximations for arbitrary t with 0 < ¢ < 1, 
giving rise to a “continuous” implicit Runge-Kutta method. 

Semi-implicit Runge-Kutta methods with all diagonal elements of A in the 
Butcher array being the same nonzero real number are called DIRK methods 
(Diagonally Implicit Runge-Kutta); see Ngrsett [1974], Crouzeix [1976], and 
Alexander [1977]. SIRK methods (Singly-Implicit Runge-Kutta) are fully implicit 
methods which share with DIRK methods the property that the matrix A (though not 
triangular) has one single real eigenvalue of multiplicity r. These were derived by 
N¢grsett [1976] and Burrage [1978a], [1978b], [1982]. DIRK methods with 7 stages 
have maximum order r + 1, but these are difficult to derive for large 7, in contrast 
to SIRK methods (see Dekker and Verwer [1984, Sects. 3.5 and 3.6]). 

The best source for Butcher’s theory of Runge-Kutta methods and their at- 
tainable orders is Butcher [1987, Chap. 3, Sects. 30-34]. A simplified version of 
this theory can be found in Lambert [1991,Chap.5] and an alternative approach 
in Albrecht [1987, 1996]. It may be worth noting that the order conditions for a 
system of differential equations are not necessarily identical with those for a single 
differential equation. Indeed, a Runge-Kutta method for a scalar equation may 
have order p > 4, whereas the same method applied to a system has order < p. 
(For p < 4 this phenomenon does not occur.) Examples of explicit Runge-Kutta 
formulae of orders 5—8 are given in Butcher [1987, Sect. 33]. 

An informative cross-section of contemporary work on the Runge—Kutta method, 
as well as historical essays, celebrating the centenary of Runge’s 1895 paper, can be 
found in Butcher [1996]. 


Section 5.7.1. The concept of stability as defined in this section is from Keller 
[1992, Sect. 1.3]. Itis also known as zero-stability (relating to h — 0) to distinguish 
it from other stability concepts used in the context of stiff differential equations; for 
the latter, see the Notes to Sect. 5.9.1. 


Section 5.7.2. Theorem 5.7.2 admits a converse if one assumes ® continuous and 
satisfying a Lipschitz condition (5.86); that is, consistency is then also necessary for 
convergence (cf. Henrici [1962, Theorem 3.2]). 


Section 5.7.3. Theorem 5.7.3 is due independently to Henrici [1962, Theorem 3.4] 
and Tihonov and Gorbunov [1963, 1964]. Henrici deals also with variable steps in 
the form alluded to at the beginning of this section, whereas Tihonov and Gorbunov 
[1964] deal with arbitrary nonuniform grids. 


Section 5.8.1. Although the idea of getting global error estimates by integrating 
the variational equation along with the main differential equation has already been 
expressed by Henrici [1962,p.81], its precise implementation as in Theorem 5.8.1 
is carried out in Gautschi [1975b]. 


Section 5.8.2(2). Fehlberg’s embedded 4(5) method with r* = 6 (cf. Fehlberg 
[1969, 1970]) appears to be a popular method. (It is one of two options provided in 
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Matlab, the other being a 2(3) pair.) A similar method due to England [1969/1970] 
has the advantage of possessing coefficients «5 = a6 = 0, which makes occasional 
error monitoring more efficient. All of Fehlberg’s methods of orders p > 5 have the 
(somewhat disturbing) peculiarity of yielding zero error estimates in cases, where 
f does not depend on y. High-order pairs of methods not suffering from this defect 
have been derived in Verner [1978]. Variable-method codes developed, for example, 
in Shampine and Wisniewski [1978], use pairs ranging from 3(4)—7(8). Instead of 
optimizing the truncation error in the lower-order method of a pair, as was done 
by Fehlberg, one can do the same with the higher-order method and use the other 
only for step control. This is the approach taken by Dormand and Prince [1980] and 
Prince and Dormand [1981]. Their 4(5) and 7(8) pairs appear to be among current 
state-of-the-art choices (cf. Hairer et al. [1993, Chap. 2, Sect. 10] and the appendix 
of this reference for codes). 


Section 5.8.3. For a proof of Theorem5.8.2, see, for example, Butcher [1987, 
Theorem 112J]. 


Section 5.9. The major text on stiff problems is Hairer and Wanner [2010]. For 
Runge-Kutta methods, also see Dekker and Verwer [1984]. 


Section 5.9.1. The function g(z) for a general Runge-Kutta method can be 
expressed in terms of the associated Butcher array as g(z) = 1 + zael(I — 
zA)~'e, where e™ = [1,1,...,1] or, alternatively, as det(I — zA + zea™)/ 
det(I — zA); see, for example, Dekker and Verwer [1984, Sect.3.4], 
Lambert [1991, Sect. 5.12], and Hairer and Wanner [2010, Chap. 4, Sect. 3]. Thus, g 
is a rational function if the method is (semi-) implicit, and a polynomial otherwise. 

The concept of A-stability, which was introduced by Dahlquist [1963], can be 
relaxed by requiring |y(z)| < 1 to hold only in an unbounded subregion S' of the 
left half-plane. Widlund [1967], in the context of multistep methods (cf. Chap. 6, 
Sect. 6.5.2), for example, takes for S an angular region | arg(—z)| < a, where 
a< $0, and speaks of A(q@)-stability, whereas Gear [1971a, Sect. 11.1] takes the 
union of some half-plane Rez < p < 0 anda rectangle p < Rez < 0, |Imz| <o, 
and speaks of stiff stability. Other stability concepts relate to more general test 
problems, some linear and some nonlinear. Of particular interest among the latter 
are initial value problems (5.16) with f satisfying a one-sided Lipschitz condition 
with constant A = 0. These systems are dissipative in the sense that || y (x) — z(x)|| 
is nonincreasing for x > a for any two solutions y, z of the differential equation. 
Requiring the same to hold for any two numerical solutions u, v generated by the 
one-step method, that is, requiring that ||u,41 — ¥n+4il| < ||4n —vn|| for all = 0, 
gives rise to the concept of B-stability (or BN-stability with the “N” standing for 
“nonautonomous’’). The implicit r-stage Runge-Kutta method of order p = 2r (cf. 
Sect. 5.9.3(3)), for example, is B-stable,and so are some of the other Gauss-type 
Runge-Kutta methods; see Dekker and Verwer [1984, Sect. 4.1], Butcher [1987, 
Sect. 356], and Hairer and Wanner [2010, Chap.4, Sects. 12 and 13]. Another 
family of Runge-Kutta methods that are B-stable are the so-called algebraically 
stable methods, that is, methods which satisfy D = diag(a|,a2,...,a@,) > 0 and 
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DA + A'D — aa" nonnegative definite (Dekker and Verwer [1984, Sect. 4.2], 
Butcher [1987, Sect. 356], and Hairer and Wanner [2010, Chap.4, Sects. 12 
and 13]). Similar stability concepts can also be associated with test equations 
satisfying one-sided Lipschitz conditions with constants 4 ~ O (Dekker and 
Verwer [1984, Sect.5.11], Butcher [1987, Sect. 357], and Hairer and Wanner 
(2010, Chap. 4, pp. 193ff]). 


Section 5.9.2. Standard texts on Padé approximation are Baker [1975] and Baker 
and Graves-Morris [1996]. The proof of Theorem 5.9.1 follows Perron [1957, Sect. 
42], who in turn took it from Padé. For a derivation of the Routh—Hurwitz criterion 
mentioned in (2), see, for example, Marden [1966, Corollary 40,2], and for Rouché’s 
theorem, Henrici [1988, p.280]. The elegant argument used in the proof of Property 
(4) is due to Axelsson [1969]. 


Section 5.9.3. For the study of A-stability, it is easier to work with the “relative 
stability function” y(z)e~< than with ¢g(z) directly. This gives rise to the “order star” 
theory of Wanner et al. [1978], which, among other things, made it possible to prove 
that the only Padé approximants to e* that yield A-stable methods are g(z) = R[n + 
k,n\(z), n = 0,1,2,..., with O < k < 2. All Gauss-type Runge-Kutta methods 
in current use have stability functions given by such Padé approximants and are 
thus A-stable (Dekker and Verwer [1984, Sect. 3.4]). Also, see Iserles and No@rsett 
[1991], and Hairer and Wanner [2010, Chap. 4, Sect.4 for further applications of 
order stars. 

In addition to the implicit Gauss-type Runge-Kutta methods mentioned in (3), 
there are also A-stable DIRK and SIRK methods; for their construction, see Butcher 
[1987, Sect. 353] and Hairer and Wanner [2010,Chap.4, Sect.6]. Another class 
of methods that are A-stable, or nearly so, are basically explicit Runge-Kutta 
methods that make use of the Jacobian matrix and inverse matrices involving it. 
They are collectively called Runge—Kutta—Rosenbrock methods; see, for example, 
Dekker and Verwer [1984, Chap. 9] and Hairer and Wanner [2010, Chap. 4, Sect. 7]. 
A criterion for A-stability of methods belonging to the function g in (5.169) can be 
found in Crouzeix and Ruamps (1977). 

In all the results previously described, it was tacitly assumed that the nonlinear 
systems of equations to be solved in an implicit Runge-Kutta method have a unique 
solution. This is not necessarily the case, not even for linear differential equations 
with constant coefficient matrix. Such questions of existence and uniqueness are 
considered in Dekker and Verwer [1984, Chap. 5], where one also finds a discussion 
of, and references to, the efficient implementation of implicit Runge—Kutta methods. 
Also see Hairer and Wanner [2010, Chap. 4, Sects. 14, 8]. 

The theory of consistency and convergence developed in Sect.5.7 for nonstiff 
problems must be modified when dealing with stiff sytems of nonlinear differential 
equations. One-sided Lipschitz conditions are then the natural vehicles, and B- 
consistency and B-convergence the relevant concepts; see Dekker and Verwer 
[1984, Chap. 7] and Hairer and Wanner [2010, Chap. 4, Sect. 15]. 
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Section 5.9.4. Regions of absolute stability for the embedded Runge-Kutta pairs 
4(5) and the 7(8) pairs of Dormand and Prince (cf. Notes to Sect. 5.8.2(2)), and 
for the Gragg, Bulirsch, and Stoer extrapolation method (cf. Notes to Sect. 5.4), 
are displayed in Hairer and Wanner [2010, Chap. 4, Sect. 2]. Attempts to construct 
explicit Runge-Kutta formulae whose regions of absolute stability, along the 
negative real axis, extend to the left as far as possible lead to interesting applications 
of Chebyshev polynomials; see Hairer and Wanner [2010, pp. 31-36]. 


Exercises and Machine Assignments to Chapter 5 


Exercises 


1. Consider the initial value problem 


d 
a = ty), OSX <1; yO)=s, 
ie 


where k > 0 (in fact, k >> 1) ands > O. Under what conditions on s does 
the solution y(x) = y(x;s) exist on the whole interval [0, 1]? {Hint: find y 


explicitly.} 
2. Prove (5.46). 
3. Prove 
(fy fy f _ F fit + HiF- 
4. Let 
f'(,y) 
2 
.y) 
fy) = . 7 . 
f4(x,y) 


be a C! map from [a, b] x R¢ to R¢. Assume that 


OF’ (x, 
ee < Mi on [a,b] x R¢, i, 7 =1,2,...,d, 


dy/ 


where Mj; are constants independent of x and y, and let M = [Mj;] € Rat F 
Determine a Lipschitz constant L of f: 


(a) in the £; vector norm; 
(b) in the £5 vector norm; 
(c) in the £45 vector norm. 


Express L, if possible, in terms of a matrix norm of M. 
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5.(a) Write the system of differential equations 


mm“ ” - 
ul” = x? uu"! — uv ; 


v’ = xv’ 4+ 4u’ 


as a first-order system of differential equations, y’ = f (x, y). 
(b) Determine the Jacobian matrix f(x, y) for the system in (a). 
(c) Determine a Lipschitz constant L for f on [0, 1]xD, where D = {y € R¢: 
ly ll: < 1}, using, respectively, the 01, €2, and 45 norm (cf. Ex. 4). 
6. For the (scalar) differential equation 


dy a 
—=y", A>0, 
dx y a 


(a) determine the principal error function of the general explicit two-stage 
Runge-Kutta method (5.56), (5.57); 

(b) compare the local accuracy of the modified Euler method with that of 
Heun’s method; 

(c) determine a A-interval such that for each A in this interval, there is a two- 
stage explicit Runge-Kutta method of order p = 3 having parameters 
0<a, <1,0<a,<l,and0<p<l. 


7. For the implicit Euler method 
Ynext = Y + hf (x + h, Ynext)s 


(a) state a condition under which Ypext is uniquely defined; 
(b) determine the order and principal error function. 


8. Show that any explicit two-stage Runge-Kutta method of order p = 2 
integrates the special scalar differential equation dy/dx = f(x), f € Pi, 
exactly. 


9. The (scalar) second-order differential equation 


2 


Z 
age = g(x, 2), 


in which g does not depend on dz/dx, can be written as a first-order system 


d y! = y? 
dx | y? =| 2y9| 
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by letting, as usual, y! = z, y? = dz/dx. For this system, consider a one-step 
method uyj41) = Uy + h®(xy_, Uy; h) with 


y? + shk(x, y;h) 


sey) =| k(x, yh) 


1 
»k = g(x+ph,y'+phy’), y = Fal 


(Note that this method requires only one evaluation of g per step.) 

(a) Can the method be made to have order p = 2, and if so, for what value(s) 
of 4? 

(b) Determine the principal error function of any method obtained in (a). 


. Show that the first condition in (5.67) is equivalent to the condition that 


k(x, yh) = u'(x + psh) + O(h?), 5 > 2, 


where u(t) is the reference solution through the point (x, y). 


. Suppose that 


x+h v 
i z(t)dt =h > Wez(x + Ugh) + cht tte (g) 
x k= 


is a quadrature formula with w, € R, 0% € [0, I,c# 0, and € € (x,x +h), for 
z sufficiently smooth. Given increment functions ®; (x, y;/) defining methods 
of order p,, k = 1,2,...,v, show that the one-step method defined by 


®(x, y;h) = So we f + Och, y + OHO, (x, y; OKh)) 
k=1 


has order p at least equal to min(jz, p + 1), where p = min p,. 


. Let g(x,y) = (fc + fy f)(%, y). Show that the one-step method defined by 


the increment function 


®(x, ysh) = f(x,y) + Shg(x + th, y + thf (x, y)) 


has order p = 3. Express the principal error function in terms of g and its 
derivatives. 


. Let f (x, y) satisfy a Lipschitz condition in y on [a,b] x R¢, with Lipschitz 


constant L. 


(a) Show that the increment function ® of the second-order Runge-Kutta 
method 


ki = f(x,y), 
ko = f(x+h,y +hk}), 


1 
D(x, ys h) = 5 (hi + ha) 
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also satisfies a Lipschitz condition whenever x + h € [a,b], and determine 
a respective Lipschitz constant M. 

(b) What would the result be for the classical Runge-Kutta method? 

(c) What would it be for the general implicit Runge-Kutta method? 


. Describe the application of Newton’s method to implement the implicit Runge— 


Kutta method. 


. Consider the following scheme of constructing an estimator r(x, y;/) for the 


principal error function t(x, y) of Heun’s method: 

ky = S (x, y), 

ko = f(x +h,y +hk), 

1 

yee yr 5 hi + ko), 

k3 = f(x +h, yn), 

kg= f(x+h+uh, y;, + whks), 

r(x, y;h) =h~?(Biki + Boko + B3k3 + Baka). 

(Note that this scheme requires one additional function evaluation, k4, beyond 


what would be required anyhow to carry out Heun’s method.) Obtain the 
conditions on the parameters (4, 61, 62, 63, 64 in order that 


r(x, y;h) =t(x,y)+ O(h). 


Show, in particular, that there is a unique set of Bs for any jz with w(u+1) ¥ 0. 
What is a good choice of the parameters, and why? 


. Apply the asymptotic error formula (5.104) to the (scalar) initial value problem 


dy/dx = Ay, y(0) = 1, on [0, 1], when solved by the classical fourth-order 
Runge-Kutta method. In particular, determine 


where uy is the Runge-Kutta approximation to y(1) obtained with step 
h=1/N. 


. Consider y’ = Ay on [0,00) for complex A with Rea < 0. Let {u,} be 


the approximations to {y(x,)} obtained by the classical fourth-order Runge— 
Kutta method with the step / held fixed. (That is, x, = nh, h > 0, and 


(a) Show that y(x) > Oas x > on, for any initial value yo. 

(b) Under what condition on can we assert that u, — Oasn — oo? In 
particular, what is the condition if A is real (negative)? 

(c) What is the analogous result for Euler’s method? 
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19. 


20. 


21. 
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(d) Generalize to systems y’ = Ay, where A is a constant matrix all of whose 
eigenvalues have negative real parts. 


. Show that any one-step method of order p, which, when applied to the model 


problem y’ = Ay, yields 
Ynext = P(MA)y, gapolynomial of degree g > p, 


must have 1 l 
gz) =14z4+ ae ah eeels a + 2?*1y(2), 

where x is identically 0 if g = p, and a polynomial of degree g — p — 1 
otherwise. In particular, show that y = O for a p-stage explicit Runge-Kutta 
method of order p, 1 < p < 4, and for the Taylor expansion method of order 
pl. 

Consider the linear homogeneous system 


(*) yi=Ay, yer’ 


with constant coefficient matrix A € R¢*?. 


(a) For Euler’s method applied to (*), determine g(z) (cf. (5.140)) and the 
principal error function. 
(b) Do the same for the classical fourth-order Runge—Kutta method. 


Consider the model equation 


% = a(x\ly ba) 0<x <0, 
dx 


where a(x), b(x) are continuous and bounded on R+, and a(x) negative with 
|a(x)| large, say, 

a <|a(x)| < AonRy,a> 1. 
For the explicit and implicit Euler methods, derive a condition (if any) on the 
step length / that ensures boundedness of the respective approximations u,, as 
X, = nh > co forh > 0 fixed. (Assume, in the case of the explicit Euler 


method, that a is so large that ah > 1.) 
Consider the implicit one-step method 


®(x, y;h) = k(x, y;h), 
where k : [a,b] x R% x (0,h0] > R¢ is implicitly defined, in terms of total 
derivatives of f, by 


r 


k= Shas fENG, y) — Bf PM + hy + hb), 


s=l 


with suitable constants a, and 6, (Ehle’s method; cf. Sect. 5.9.3(4)). 
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(a) 


(b) 


(c) 


Show how the method works on the model problem dy/dx = Ay. What 
is the maximum possible order in this case? Is the resulting method (of 
maximal order) A-stable? 

We may associate with the one-step method the quadrature rule 


x+h r 
/ g(t)dt = >> h[asg°—? (x) — Beg? + hy] + E(g). 


s=l 


Given any p withr < p < 2r, show that a,;, 8; can be chosen so as to have 
E(g) = O(h?t!) when g(t) =e. 

With a;, Bs chosen as in (b), prove that E(g) = O(h?t!) for any 
g € C? (not just for g(t) = e'~*). {Hint: expand E(g) in powers of h 
through /? inclusive; then specialize to g(t) = e’~* and draw appropriate 
conclusions.} 


(d) With a;, 6, chosen as in (b), show that the implicit one-step method has 


order p if f © C”. {Hint: use the definition of truncation error and 
Lipschitz conditions on the total derivatives f '~!!.} 


(e) Work out the optimal one-step method with r = 2 and order p = 4. 


(f) 


How can you make the method L-stable (cf. (5.170)) and have maximum 
possible order? Illustrate with r = 2. 


Machine Assignments 


1. (a) 


(b) 


Write Matlab routines implementing the basic step (x, y) (x +A, Ynext) 
in the case of Euler’s method and the classical fourth-order Runge-Kutta 
method, entering the function f of the differential equation y’ = f(x, y) 
as an input function. 

Consider the initial value problem 


y’ = Ay, 0<x <1, y(0) =1, 


where 


Ana+A3  AZ—-Ay Ad—-Ay 1 
A=; A3—Arn AVH+Az3 AL—-Ad},1=)] 1 
Aa—Az3z, AG — AZ AL + Ar 1 


The exact solution is 


y! y! _ ex he eh2x 4 eh3x 
y(x) — y? : y? = eh ix _ e}2* + es, 
y3 y? — ex + eh2* —_ e3X, 
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Integrate the initial value problem with constant step length h = 1/N by 


(i) Euler’s method (order p = 1); 
(ii) the classical Runge-Kutta method (order p = 4), 


using the programs written in (a). In each case, along with the approxi- 
mation vectors u, € R*,n = 1,2,...,N, generate vectors v, € R3, 
n = 1,2,...,N, that approximate the solution of the variational equation 
according to Theorem 5.8.1. (For the estimate r(x, y;h) of the principal 
error function take the true value r(x, y;h) = t(x, y) according to Ex. 19.) 
In this way obtain estimates é, = h?v, (p = order of the method) of the 
global errors e, = u, — y(x,). Use N = 5,10, 20, 40, 80, and print x, 
len loo, and |[@y |loo for x, = .2:.2:1. 
Suggested A-values are 


(i) Ay =—1,A2 =0,43 = 1; 

(i) A, = 0,A2 =—1,A3 = —10; 
(iii) Ay = 0,42 =—1,A3 = —40; 
(iv) A, =0,A2 =—1,A3 = —160. 


Summarize what you learn from these examples and from others that you 
may wish to run. 


2. Consider the initial value problem 


(a 
(b 


(c 


(d 


) 
) 


wt lS 


y” =cos(xy), y0) = 1, y'(0) =0,0<x <1. 


Does the solution y(x) exist on the whole interval 0 < x < 1? Explain. 

Use a computer algebra system, for example Maple, to determine the 
Maclaurin expansion of the solution y(x) up to, and including, the term with 
x°°, Evaluate the expansion to 15 decimal digits for x = 0.25 : 0.25: 1.0. 
Describe in detail the generic step of the classical fourth-order Runge-Kutta 
method applied to this problem. 

Use the fourth-order Runge—Kutta routine RK4 .m of MA1(a), in conjunction 
with a function f{MAV_2.m appropriate for this assignment, to produce 
approximations u, ~ y(x,) at x, = n/N, n = 0,1,2,...,N, for 
N = [4, 16, 64, 256]. Print the results y(x,), y/(xn) to 12 decimal places 
for x, = .25 : .25 : 1.0, including the errors e, = |u} — y(x,)|. (Use the 
Taylor expansion of (b) to compute y(x,).) Plot the solution y, y’ obtained 
with N = 256. 


3. On the interval [2,/¢,2./¢g + 1], g = 0 an integer, consider the initial value 
problem (Fehlberg, 1968) 


dc a S 
— = -1°x*c —1 ———, 
dx? a/c? + 5? 


ds c 
dx2 a [2 + 52) 


| 
| 
q 
Ne 
Se 
i} 
im 
oh 
=] 
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with initial conditions at x = 2,/q¢ given by 


dc ds 
=1,—=0,s=0, — =2 ; 
° dx dx ne 


(a) Show that the exact solution is 


_ ee) — on (= *) 
c(x) = cos (Fx \s s(x) = sin (Fx : 
(b) Write the problem as an initial value problem for a system of first-order 
differential equations. 
(c) Consider the Runge—Kutta—Fehlberg (3, 4) pair ®, ®* given by 
ki = f(x,y), 


2 2 
k> = f («+ gis y + sk). 


7 77 343 
n= 7 (: + —h,y + ——hk, + soaiika)) 


15 900 900 
35 805 TINTS 97125 

7 (: + 3g + Taqghht — sagq7g he + serait 
7. 9175... 2166 


ey eee k Ri: 
(ys) = Took + se56*3 + oogs 4 


respectively 


k\,k2,k3,k4 as previously, 


ks= f(xth,y +h®(x, y;h)), 


229 1125 13718 1 
Peri = k k k io 
(WW) = Teak + Teigks + gisgshs + Tghs 


Solve the initial value problem in (b) for g = 0(1)3 by the method ®, using 
constant step length h = 0.2. Repeat the integration with half the step length, 
and keep repeating (and halving the step) until max, ||u, — y(Xn)|loo < 
5 x 107°, where uy, y(x,) are the approximate resp. exact solution vectors 
at Xn = 2./q + nh. For each run print 


q, h, “ \|un = Y (Xn) loo _— lie; ae s aa 1, 


where c,,, 5, are the approximate values obtained for c(x,) resp. s(x,), and 
the maxima are taken overn = 1,2,..., N with N such that Nh = 1. 
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(d) For the same values of g as in (c) and h = 0.2, 0.1, 0.05, 0.025, 0.0125, print 
the global and (estimated) local errors, 


q, h, l|n+1 — V(Xn+1) loo, h\|® (xn, Un; h) a ®* (Xn, Uni )\loo; 


for Xx, = 0(.2).8. 

(e) Implement Theorem 5.8.1 on global error estimation, using the Runge— 
Kutta—Fehlberg (3,4) method ©, ®* of (c) and the estimator r(x, y;h) = 
h~>[®(x, y;h) — ®*(x, y;h)] of the principal error function of ®. For the 
same values of g as in (d), and for h = 0.05, 0.025, 0.0125, print the exact 
and estimated global errors, 


qd, fh, \lUnt1 —yOn4+dlloo; 1? n+ tlloo for Xx, =0:.2:.8. 


4. (a) Let f(z) = 14+ act aot: sof ae For p = 1(1)4 write a Matlab program, 
using the contour command, to plot the lines along which | f(z)| rs 
r = 0.1(0.1)1 (level lines of f) and the lines along which arg f(z) = 9, 
6= O(gx)2m - a (phase lines of f). 

(b) For any analytic function /, derive differential equations for the level and 
phase lines of f. {Hint: write f(z) = r exp(i@) and use @ as the independent 
variable for the level lines, and 7 as the independent variable for the phase 
lines. In each case, introduce arc length as the final independent variable.} 
Use the Matlab function ode45 to compute from the differential equation of 
(b) the level lines | f(z)| = 1 of the function f given in (a), for p = 1(1)21; 
these determine the regions of absolute stability of the Taylor expansion 
method (cf. Ex. 18). {Hint: use initial conditions at the origin. Produce only 
those parts of the curves that lie in the upper half-plane (why?). To do so in 
Matlab, let ode45 run sufficiently long, interpolate between the first pair of 
points lying on opposite sides of the real axis to get a point on the axis, and 
then delete the rest of the data before plotting.} 

5. Newton’s equations for the motion of a particle on a planar orbit (with eccentric- 
ity e,0 < e < l)are 


(c 


Ne 


a= =e x(0) = 1-e, x'(0) =0, 
r 
t>0, 
y l+e 
y"=—5, y(0) =0, y'(0) = i 
r l-e 
where 
Parry? 
(a) Verify that the solution can be written in the form x(t) = cosu — ¢, 


y(t) = V1—e?*sinu, where u = u(t) is the solution of Kepler’s equation 
u—esinu—t =0. 
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(b) Reformulate the problem as an initial value problem for a system of first- 
order differential equations. 

(c) Write a Matlab program for solving the initial value problem in (b) on the in- 
terval [0, 20] by the classical Runge-Kutta method, for ¢ = 0.3, 0.5, and 0.7. 
Use step lengths h = 1/N, N = [40, 80, 120] and, along with the approx- 
imate solution u(t; /), v(t; 2), compute and plot the approximate principal 
error functions r(t;h) = h~*[u(t;h) — x(t)], s(t;h) = h-*[v(t;h) — y(t)] 
when N = 120 (ie., h = .008333...). Compute the exact solution 
from the formula given in (a), using Newton’s method to solve Kepler’s 
equation. 
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15. We know from (5.64) (where aw = 1/2) that 


1/1 1 
T(x, y) = 2 Fea + 2fryf ++ Sf" fhyyf)- za tx a f.t)| ’ 


and from (5.59) (where 4 = 1) that 


ha = f+ RSet fof) + GW Sex + fos f + ST Soy f) + OC), 


For k3 and k4, one finds similarly that 


ks =f +A + Sy S)+ Ghee + hoof +S hay f 
+ fy(fe + fyf) + O08), 

ke = f+ MEDAL SpA) F SUFI See + feof + Phyl) 
+ OME DA fe + fy f+ O°). 


For h-?(B,k, + Bok2 + B3k3 + Bak) to approximate t(x, y) to within O(h), 
that is, 


h*(Bi + Ba + Bs + Ba) f + h7'[Bo + Bs + (H+ DBS + fy f) 
+ 5 (62+ Bs + G+ WBN Se + Serf $7 fof) 


+ (B3 + 2u + I)Ba) fy (Se + fy] + OCA) = (x, y) + OF), 
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requires that 


Bi + Bo + B3 + Bs = 0, 


Bo + 63+ (ut 1)B4 = 0, 

1 

Bo + Bs + (w+ W’Ba = & 

1 

B3 + (2u + 1)B4 = = 

The determinant of this system is 

1 1 1 1 1 1 1 1 

Oo 1 1 ptt 0 1 1 w+ 
= — i 1), 
Oi 1 tee?) 0 6 oO ge iy)— or? 

00 1 2u+1 0 0 1 +1 


so that the condition j4(44+1) 4 0 ensures a unique solution. A good solution is 
the one corresponding to 4 = 1, since then no extra evaluation of f is needed, 
k, and k4 being the evaluations needed anyhow to execute the second step of 
Heun’s method. In this case, we get 


5 7 1 


1 
— 1, — egy SS = —-—-— 
Le Bi 12 Bo 12 B3 


(For literature, see L.F. Shampine and H.A. Watts, Computing error estimates 
for Runge—Kutta methods, Math. Comp. 25 (1971), 445-455.) 
21. (a) If y’ = Ay, then f(x, y) =Ay, f(x, y) = Ary, and 


r 


k = Oh fasaty — BA’ (y + hk)] 


s=l 


1 r r 
= 7 Das — Bs)(Aby'y — D7 Bs(Ah)k. 
s=1 


s=l1 


Thus, 
r 1 a 
(: +>. a.m) k= = Yas — Bs)(Ah)*y. 
s=1 s=l1 


Letting z = Ah, we get 


1 ais = Bs)z 
h 1+ St Bsz* 


k(x, y;h) = 
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There follows 


Ynext = Y + hk(x, y;h) = y1+ Desai ls ~ Bs)z 


P+ yt Bszs 
ey Ysa Ue" y 
I+ St Bszs 
that is, 
1+ Yu, as25 
o@ = tee 
1 +r ae Bz 


The method has order p if and only if 


ls ara 
1+ ei Bszs 


The maximum possible order is p = 2r and is obtained by choosing the ay 
and f; so that g is the Padé approximant R[r, r] to the exponential function. 
The resulting method is A-stable by Theorem 5.9.2. 

For g(t) = e’~*, we have g°—!)(t) = e'*, so that the quadrature formula 
becomes 


Z 


+ 0(2?t'), 230. 


(b 


we 


x+h r 
/ edt =e" -1= SY oa(as = pe") + Ee). 


s=l1 


Therefore, 


E(g) = ra = ash*® + el 3 B;h* 
s=1 s=1 
= (: + 3 Aw’ e = (: + Sa] 
s=l s=1 


zs 1+ SL, ash* 
=[{1+) Bh jet = 
(+50) fe ea 


Given any p with r < p < 2r, we can always choose nonnegative 
integers m < r andn < r such that p = n+ ™m. Setting &,4; = 

-= a, = 0, Bo4i = --: = B, = O, and choosing the remaining 
parameters so that the rational function in braces becomes the R[n, m]- 
Padé approximant to the exponential function makes the expression in 
braces of order O(h"t"*!) = O(h?*!), that is, E(g) = O(h?t!). 


ner 


Ww 
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Clearly, p = 2r is optimal, in which case m = n = r, and we get uniquely 
the R[r, r]-Padé approximant. If p < 2r, there are more than one choice 
for m and n. 

Assuming g € C?, we have the expansions 


x+h #2 pp 
/ g(t)dt = hg(x) + we @) Aste me) ae O(n), 


Ast! 
hi g(x +h) = 8g) + = 8) 


ane gh) (x) + OP"). 


(p=s)! 


Substitution into the quadrature formula yields an expansion of the type 


E(g) = eyhg(x) + enh2g' (x) $+: + eh? g?— (x) + O(h?*), 


with certain constants e; independent of h and g. Now put g(t) = e’*. 


Since g(x) = 1 for all i > 0, we get 
E(e™*) = eh + enh? +++» + eh? + O(?T"), 


Since the quadrature rule has been chosen such that E(e’~*) = O(h?*!) 
(cf. (b)), it follows that, necessarily, e; = e2 = ++: = €, = O. Hence, 
E(g) = O(h?*!) for any g € C?[x,x + hl. 

We have for the truncation error of the method, 


T (x, ysh) = R(x, yh) — > [u(x +h) — uC) 


with u(t) the reference solution through (x, y). Since by assumption f € 
C’, wehaveu € C?*!, Put g(t) = u’(t) in the quadrature formula, divide 
by h, and use (c) to get 


1 ~ 
zp lex +h) — w(x)] = DAL (x) — Bow (x + h)] + O(h?). 
s=l 
Therefore, since f'—'(x, y) = u(x) (cf. (5.47)), we obtain 


T(x, yh) = Doh "fan (x) — Bf! Mx + hy + hk) 


s=l 


_ <a [au (x) _ Bu (x + h)| + O(A?) 
s=l1 
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=— Soh Bf Na thy + hk) 


s=1 


— f© Va + hou +h) + O(n"), 
where we have used u(t) = f&'"(t, u(t)) (cf. (5.46)). With Ls— 


denoting a Lipschitz constant for f~"!, the expression in brackets is 
bounded in norm by 


Ls-illy + hk(x, ys h) — u(x + h)| 
1 
= AL s/h, ih) — 5 [ee +h) — (Ill = RL Al P@, ys ADI 


since u(x) = y. There follows 
IT, ys All < (s>¥ i840.) T(x, ys h)|| + O(A7), 
s=l1 
that is, 
[1— O/T (x, ysh)|| = O11"), T(x, ysh)|| = O(0"). 


(e) The [2,2]-Padé approximant to the exponential function is (cf. (5.149), 
(5.150)) 


l+424+ $2 
R[2, 2(2) = —2—— BP 
1—- ze + ps 


Therefore, according to (b), 


1 


1 
n= ==; po = —. 


i 
ee 12 12 


(f) Choose for g(z) the Padé approximant R[r, r — 1](z) by setting a, = 0. The 
(maximum) order is then p = 2r — 1. Forr = 2, this becomes 


R[2, 1] = 


so that 


392 


5 Initial Value Problems for ODEs: One-Step Methods 


Selected Solutions to Machine Assignments 


2. (a) Write y’ = z, so that the initial value problem can be written in vector form 


(b 


wm 


as 


Se hes) Pao a 
ole lexean }”]o=[4]: Veg 1. 


We claim that the right-hand side of the system satisfies a uniform Lipschitz 


condition on [0, 1] x R? in the L;-norm, with Lipschitz constant L = 1. 
Indeed, 
z—z _ z—2Z* 
cos(xy) —cos(xy*) |}, ff —2 sine) sina) |, 
2 =<. \ 


as claimed. By Theorem 5.3.1, therefore, the initial value problem has a 
unique solution, which exists on all of [0.1]. 

The following Maple program produces the Taylor expansion of the solution 
at x = 0 to 50 terms and evaluates it for x = .25: .25: 1 to 15 digits. 


eq:=diff(y(x), 
ini:=y(0)=1, Dly 
Order:=50; 
sol:=dsolve({eq, ini}, {y(x) },type=series) ; 
p:=convert (sol,polynom) ; 

Digits:=15; 

for x from .25 to 1 by .25 do p(x) od; 


x,xX)-cos (x*y (x) )=0; 
) (0) =0; 


The result 
y(.25) = 1.03108351021039, 
y(.50) = 1.12215798572506, 
y(.75) = 1.26540089160147, 


y(1.0) = 1.44401698100709, 


confirmed by another Maple run with 60 terms, is used to determine the error 
in the Matlab routine MAV_2D below. 
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(c) The classical fourth-order Runge-Kutta method, for the first-order system, 


emanating at the generic point (». y }): proceeds as follows: 
Zz 


ky} Z 
o cos(xy) |’ 


z+ Shey 
cos((x + 5h)(y + $hk:)) 


zt+he3 
cos((x + h)(y + hks3)) 


Z+ shby 
cos((x + 5h)(y + $hk2)) |’ 


: L: 


(d) PROGRAM 


dt (CtC10D 


For the routine RK4 .m, see the answer to MA I(a). 


MAV_2D 


x 
© 


f£0='%8.2£ $16.12f %15.12f %12.4e N=%3.0£\n’; 
fl='%8.2£ $16.12f %15.12f %12.4e\n’; 

exacts [1.03108351021039;1.12215798572506; 
1.26540089160147;1.44401698100709] ; 


disp ([’ x y dy/dx’ 
( error’]) 
for N=[4 16 64 256] 

h=1/N; 


y=zeros (N+1,1); yl=zeros(N+1,1); 
y(1)=1; y1(1)=0; u=[1;0]; 


=(n-1)/N; 
u=RK4 (@£MAV_2,x,u,h) ; 
y(n+1)=u(1); yl (n+1)=u(2) ; 
if 4*n/N-f£ix(4*n/N) == 
ip=ipt+1; 
yexact=exact (ip); err=abs(u(1) -yexact) ; 
if n==N/4 


394 


5 Initial Value Problems for ODEs: One-Step Methods 


fprintf (£0,n/N,u(1),u(2),err,N) 


else 


fprintf (f1,n/N,u(1),u(2),err) 


end 


end 


end 


fprintf(’\n’) 
if N==256 


(O:N) ‘/N; 


hold on 


SFMAV_2 


x 
© 


plot (x,yl,'--'); 


) 
.2,'y’,'’FontSize’ ,14) 
.75,'dy/dx’ ,’FontSize’ ,14) 


function yprime=fMAV_2 (x,y) 
yprime=[y(2) ;cos(x*y(1))]; 


OUTPUT 
>> MAV_2D 
x ¥ 
0.25 1.031084895175 
0.50 1.122162986649 
0.75 1.265408064509 
1.00 1.444014056895 
0.25 1.031083515581 
0.50 1.122158005046 
0.75 1.265400918802 
1.00 1.444016971240 
0.25 1.031083510231 
0.50 21.122157985801 
0.75 1.265400891710 
1.00 1.444016980977 
0.25 1.031083510210 
0.50° 1.122157985725 
0.75 1.265400891602 
1.00 1.444016981007 


>> 


oOoo°o ooo°o OoOoCo°o 


ooo °o 


dy/dx 


- 247302726779 
-476315286524 
- 658662664483 
. 751268242944 


- 247306326743 
-476319854529 
-658656442176 
- 751230496912 


- 247306340242 
-476319869139 
-658656406501 
. 751230324680 


- 247306340295 
-476319869194 
- 658656406353 
. 751230323984 


WrRAIN oNnNrF U [NOS Od 


BoB © 


Differential equation for MAV_2 


error 


-3850e-06 
-0009e-06 
-1729e-06 
-9241e-06 


-3703e-09 
-9320e-08 
.7201e-08 
-7675e-09 


~LOZ1le*11 
-5974e-11 
.0847e-10 
-0409e-11 


.1268e-14 
29399e=13 
,2588e-13 
.1346e-13 


N= 4 
N= 16 
N= 64 
N=256 
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PLOT 
1.5, 
y 
1 
dy/dx awe 
0.5+ yee 
Ya 
(0) a l l 1 1 n 1 1 1 Ll J 


5. (a) First differentiate 
u(t) — esinu(t) —t = 0 


to obtain 
1 


u(t) = ————_. 
1—ecosu 


Letting x(t) = cosu—e, y(t) = V1—e?sinu, we then have 
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't) —sinu na) —cosu + € x(t) 
Xx. a x = =— 
1—ecosu (1 — ecosu)3 (1 — ecosu)3 
= = COS U yO = _Mvi —eé*sinu a y(t) 
1—ecosu (1 — ecosu)3 1 —ecosu)3 


Since, on the other hand, 


x?(t) + y(t) = (cosu—)? + (1 — £7) sin? u = (1 — ecosu)’, 
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we have r = 1 — €cosu, and thus 


x x 

6 
dol a, Ps 
ee Mere 


showing that the differential equations are satisfied. So are the initial 
conditions, since u(0) = 0, and thus 


x(0) = 1-«,x’(0) =0; y(0) =0, y’(0) = 


V1 — & 


l+e 


_ 


7 


(b) Letu = x,v = x',w= y,z= y’. Then the first-order system is 


(c) 


du 
— =), 0)=1-e, 
at u(0) 
dv 
See, eye, 
dt r vo) 
dw 
T =% 0) =0, 
age w(0) 
dz l+e 
—=-> 0) = : 
a ee l—é 
where r2 = w2 + w’. 
PROGRAM 
SMAV_5C 
e=.3; 
se=.5; 
se=.7; 
eps0=.5e-12; 


for N=[40 80 120] 


h=1/N; ul=[1-e;0;0;sqrt((l+e)/(1-e))]; v1 


r=zZeros (20«N,1); 
for n=0:20«N-1 


u=uU1; v=vl1; 


s=zeros (20«N,1); 


t=an*h; tl=t+h; 


ul=RK4 (@£fMAV_5C,t,u,h) ; 
vi=kepler(t1,v,e,eps0) ; 


l-e 


0; 
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y= [cos (v1) -e; sqrt (1-e*2)*sin(v1)]; 
pr=h* (-4)*(ul(1:2)-y); 
x(n+1)=t1; r(n+1)=pr(1); s(n+1)=pr(2); 


end 
if N==120 
hold on 
plot (x,r); 
plot (x,s,'--'); 
axis([0 20 -5 5]) % for e=.3 
% axis([0 20 -60 80]) % for e=.5 
% axis([0 20 -4000 7000]) % for e=.7 
xlabel(’t’) 
ylabel(‘r, s’) 
text (9.5,2,'r(t)’) % for e=.3 
% text (10.75,-2,’s(t)’) % for e=.5 
% text (10.5,1500,’r(t)’) % for e=.7 
text (9.5,20,’'r(t)’) % for e=.3 
% EexXE(8.2;4+25 -"S(t)“) % for -e=.5 
% text (9,-1000,’s(t)’) % for e=.7 
hold off 
end 


end 
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Chapter 6 
Initial Value Problems for ODEs: 
Multistep Methods 


We saw in Chap.5 that (explicit) one-step methods are increasingly difficult to 
construct as one upgrades the order requirement. This is no longer true for multistep 
methods, where an increase in order is straightforward but comes with a price: a 
potential danger of instability. In addition, there are other complications such as the 
need for an initialization procedure and considerably more complicated procedures 
for changing the grid length. Yet, in terms of work involved, multistep methods are 
still among the most attractive methods. We discuss them along lines similar to one- 
step methods, beginning with a local description and examples and proceeding to 
the global description and problems of stiffness. By the very nature of multistep 
methods, the discussion of stability is now more extensive. 


6.1 Local Description of Multistep Methods 


6.1.1 Explicit and Implicit Methods 


We consider as before the initial value problem for a first-order system of differential 
equations 


d 
ae SOY), a<x <b; y(a) = yo (od) 


(cf. Chap. 5, (5.14)-(5.16)). Our task is again to determine a vector-valued grid 
function u € T;,|a, b] (cf. Chap. 5, Sect.5.7) such that u, ~ y(x,) at the nth grid 
point x,. 

A k-step method (k > 1) obtains u,+, in terms of k preceding approximations 
Un+k—1> Unt+k—2,-.+;Un. We call k the step number (or index) of the method. 
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We consider only linear k-step methods, which, in their most general form, but 
assuming a constant grid length h, can be written as 


Unt+k + Ak—1Un+k-1 +++ Aguy, 
=h[Br fa+k + Be—-ifntk—-1 t-*: + Bofn], m=0,1,2,...,N—k, (6.2) 


where 
x,=at+rh, f, = f(%,u,), r=0,1,...,N, (6.3) 


and the a and f are given (scalar) coefficients. The relation (6.2) is linear in the 
function values f,. (in contrast to Runge-Kutta methods); nevertheless, we are still 
dealing with a nonlinear difference equation for the grid function wu. 

The definition (6.2) must be supplemented by a starting procedure for obtaining 
the approximations to y (xs), 


us =us(h), s=O0,1,...,k—-1. (6.4) 
These normally depend on the grid length /, so may also the coefficients a;, Bs in 


(6.2). The method (6.2) is called explicit if By; = 0 and implicit otherwise. 
Implicit methods require the solution of a system of nonlinear equations, 


Un+k = NBef Xn+k, Untk) + 8p (6.5) 
where 
k-1 k-1 
Sn = h > Bs fats = > AsUnts (6.6) 
s=0 s=0 


is a known vector. Fortunately, the nonlinearity in (6.5) is rather weak and in fact 
disappears in the limit as h — 0. This suggests the use of successive iteration on 
(6.5), 
Mh Dee =1,2 7 
Wig =MBS (Xntk Unig )E Bn V=1,2,..., (6.7) 
where ul , 18 a Suitable initial approximation for u,+,. By a simple application 
of the contraction mapping principle (cf. Chap. 4, Sect. 4.9.1), one shows that (6.7) 
indeed converges as v — oo, for arbitrary initial approximation, provided h is small 
enough. 


Theorem 6.1.1. Suppose f satisfies a uniform Lipschitz condition on [a,b] x R4 
(cf. Chap. 5, Sect. 5.3), 


If@.y)- SO. y* I < Lily —y*|, x €la.d], y,y* eR’, (6.8) 


and assume that 


A= AlBIL <1. (6.9) 
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Then (6.5) has a unique solution u,+,. Moreover, for arbitrary yu"! 


n+k’ 
Unti = lim wet gs (6.10) 
and 
i <u, — 2 = 1,2,3 6.11 
Untk 7 Un+ky S 1-2 Unik ~ Untk » V=1,4,5,... ( . ) 


Proof. We define the map g: R¢ — R¢ by 


O(Y) = hBa f (Xntk,Y) + Bn- (6.12) 


Then, for any y, y* € R¢, we have 


llo(y) — e(y*)Il = ALBa MS Conte. 9) — fF On+k, ¥*)I 
< hlBx|Lily — y* |. 


showing, in view of (6.9), that g is a contraction operator on R@. By the contraction 
mapping principle, there is a unique fixed point of g, that is, a vector y = Uy,+, 
satisfying g(y) = y. This proves the first part of the theorem. The second part is 
also a consequence of the contraction mapping principle if one notes that (6.7) is 


| 


just the fixed point iteration uw). = o(ul'). Oo 


Strictly speaking, an implicit multistep method requires “iteration to conver- 
gence” in (6.7), that is, iteration until the required fixed point is obtained to machine 
accuracy. This may well entail too many iterations to make the method competitive, 
since each iteration step costs one evaluation of f. In practice, one often terminates 
the iteration after the first or second step, having selected the starting value Pha k 
judiciously (cf. Sect. 6.2.3). It should also be noted that stiff systems, for which L 
may be quite large, would require unrealistically small steps h to satisfy (6.9). In 
such cases, Newton’s method (see Ex. 1), rather than fixed point iteration, would be 
preferable. 


6.1.2 Local Accuracy 
In analogy with one-step methods (cf. Chap. 5, (5.77) and (5.78)), we define residual 
operators by 
(Rv)(x) := v(x) — f(x. v(x), v € Cla, d], (6.13) 
i k 
(RiV)n 2= h 2 cine = Y BS Gate) ve T,[a, 5], 


s=0 
n=0,1,...,N —k. (6.14) 
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(We may arbitrarily define (Ry,v)y = --- = (Rav) w—e41 = (Rav) nx to obtain a 
grid function R;,v defined on the entire grid.) In (6.14) and throughout this chapter, 
we adopt the convention a, = 1. Since there is no longer a natural “generic” point 
(x, y) in which to define our method, we take the analogue of Chap. 5, (5.81) (except 
for sign) as our definition of truncation error: 


(Th)n = (Riy)an, n= 0,1 woe uns N, (6.15) 


where y(x) is the exact solution of (6.1). This defines a grid function 7), on a 
uniform grid on [a,b]. We define consistency, order, and principal error function 
as before in Chap. 5, that is, the method (6.2) is consistent if 


|Tilloo +0 ash—0, (6.16) 


has order p if 
I| Th lloo = Oh’) as h > 0, (6.17) 


and admits a principal error function t € C [a,b] if 
t(x) £0 and (Th)n = t(x,)h? + O(h?T') as hh > 0 (6.18) 


in the usual sense that || T),—h?T ||oo = O(h?*!). The infinity norm of grid functions 
in (6.16)—(6.18) is as defined in Chap. 5, (5.75). 
Note that (6.15) can be written in the simple form 


k k 
— As Y(Xn+s5) —h > Bsy'(Xn+s) = h(Th)n, (6.19) 
s=0 


s=0 


since f (Xn+s5,Y(On+s)) = y'(Xn+s) by virtue of the differential equation (6.1). 
Although (6.19) is a relation for vector-valued functions, the relationship is ex- 
actly the same for each component. This suggests defining a linear operator 
Ly: C'[R] > C[R] on scalar functions by letting 


k 


(Liz)(x) := Do [asz(x + sh) — hBsz(x + shy], 2€ C' IRI. (6.20) 
s=0 


If Lz were identically 0 for all z € C'[R], then so would be the truncation 
error, and our method would produce exact answers if started with exact initial 
values. This is unrealistic, however; nevertheless, we would like L;, to annihiliate as 
many functions as possible. This motivates the following concept of degree. Given 
a set of linearly independent “gauge functions” {@,(x)}°2, (usually complete on 
compact intervals), we say that the method (6.2) has (2-degree p if its associated 
linear operator L;, satisfies 


Lpw = 0 forall aE Q,, all h>0. (6.21) 
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Here, Q, is the set of functions spanned by the first p + 1 gauge functions 
Wo, W1,...,@p, hence 


29 C2, CAC, dim Qn =m+1. (6.22) 


We say that Q2,,. is closed under translation if w(x) € Qy» implies @(x + c) € Qn 
for arbitrary real c. Similarly, Q,, is said to be closed under scaling if w(x) € 
Qm implies w(cx) € Q,, for arbitrary real c. For example, algebraic polynomials 
Qi = P» are closed under translation as well as scaling, trigonometric polynomials 
Qm = T[0, 22] and exponential sums Q,, = E,, (cf. Chap. 2, Examples to (2.2)) 
only under translation, and spline functions S* (A) (for fixed partition) neither under 
scaling nor under translation. 
The following theorem is no more than a simple observation. 


Theorem 6.1.2. (a) If 2, is closed under translation, then the method (6.2) has 
Q2-degree p if and only if 


(L;@)(0) =0 forall w €Q,, all h>0. (6.23) 


(b) If &, is closed under translation and scaling, then the method (6.2) has 
Q2-degree p if and only if 


(L1@)(0) = 0 forall w € Qy, (6.24) 


where L, = Ly forh = 1. 


Proof. (a) The necessity of (6.23) is trivial. To prove its sufficiency, it is enough to 
show that for any xp € R, 


(Li@)(xo) = 0, all a € Q,, all h > 0. 


Take any w € Q, and define w(x) = w(x + Xo). Then, by assumption, wp € Q); 
hence, for all h > 0, 


k 


0 = (Lnoo)(0) = | [oso(xo + sh) — hBso' (xo + shy] = (Lio) (x0). 
s=0 


(b) The necessity of (6.24) is trivial. For the sufficiency, let w(x) = w(xo + xh) 
for any given w € Q,. Then, by assumption, wp € Q, and 


k 


0 = (Lia )(0) = = [a@;@o(s) — Bsag(s)] 


s=0 


k 
= PF lasco(xo + sh) — hB,o' (xo + sh)] = (Liw)(xo). 
s=0 


Since x9 € R and h > 0 are arbitrary, the assertion follows. O 
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Case (b) of Theorem6.1.2 suggests the introduction of the linear functional 
L: C'[R] > R associated with the method (6.2), 


k 
Lu:= ) | [osu(s) — Bou'(s)], we C RI. (6.25) 


s=0 


For Q,, = Pm we refer to Q-degree as the algebraic (or polynomial ) degree. Thus, 
(6.2) has algebraic degree p if Lu = 0 for all u € P,. By linearity, this is equivalent 
to 


Lt' =0, r=0,1,...,p. (6.26) 
Example. Determine all explicit two-step methods 
Un+2 + AyUn+1 + Apu, = h(Bi fn4i i Bofn) (6.27) 


having polynomial degree p = 0, 1,2, 3. 
Here 


Lu = u(2) + au(1) + aou(0) — Byu’ (1) — Bow'(0). 
The first four equations in (6.26) are 
l+a,;+a=0, 
2+ a; — Bi — Bo = 0, 
4+ a, —2f, =0, 
8 +a, — 36; = 0. (6.28) 


We have algebraic degree 0, 1, 2, 3 if, respectively, the first, the first two, the first 
three, and all four equations in (6.28) are satisfied. Thus, 


a; =—d—1, Bo, fi arbitrary (p = 0), 
a, = —a —1, Bj} =-ay—fo+1 (p=), 


1 3 1 1 
= — — 1, => —— =e =-— - = 2), 
a) = —Oo Bi=—Z +5, Bo=—Z5%—-—5 (p=2) 
ay = 4, ap = —5, By = 4, Bo = 2 (p => 3) (6.29) 
yield (3 — p)-parameter families of methods of degree p. Since Lt*+ = 16 + a; — 
4B, = 4 4 0, degree p = 4 is impossible. This means that the last method in (6.29), 


Un+2 + 4uns+i = Sun = 2h(2 fn+i + Sn), (6.30) 


is optimal as far as algebraic degree is concerned. Other special cases include the 
midpoint rule 


Unt+2 = Uy + 2h fn+1 (a _ —1; p= 2) 
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and the Adams—Bashforth second-order method (cf. Sect. 6.2.1), 


1 
Un+2 = Unti + xO Fnt1 — fn) (a =0; p = 2). 


The “optimal” method (6.30) is nameless — and for good reason. Suppose, indeed, 
that we apply it to the trivial (scalar) initial value problem 


y' =0, y(0)=0 on 0<x <1, 


which has the exact solution y(x) = 0. Assume up = 0 but uw; = € (to account for 
a small rounding error in the second starting value). Even if (6.30) is applied with 
infinite precision, we then get 


1 
Un = raat + (-1)"t!5"], n =0,1,...,N. 


Assuming further that e = h?*+! (pth-order one-step method), we will have, at the 
end of the interval [0,1], 


1 1 
uN = gern + (-1)NT5%] ~ i (—1)¥+! N-P-15% as N > on. 


Thus, |u| — oo exponentially fast and highly oscillating on top of it. We have here 
an example of “strong instability”; this is analyzed later in more detail (cf. Sect. 6.3). 


6.1.3 Polynomial Degree vs. Order 


We recall from (6.19) and (6.20) that for the truncation error T;, of (6.2), we have 


k 


h(Th)n = >> [ots ¥ (Xn + 8h) — hBsy' (Xn + sh)] = (Lay) (Xn), 
s=0 


n=0,1,2,...,N—k. (6.31) 
Let 
u(t) = y(%, +th), O<t <k. (6.32) 


(More precisely, we should write w,,;(¢).) Then 


k 


h(Ti)n =D [tsu(s) — Bou'(s)] = Lu, (6.33) 


s=0 
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where L on the right is to be applied componentwise to each component of the 
vector u. If 7), is the truncation error of a method of algebraic degree p, then the 
linear functional L in (6.33) annihilates all polynomials of degree p and thus, if 
u € C’*![0, k], can be represented in terms of the Peano kernel 


Ap(o) = Lay(t — oa) (6.34) 
by 
1 k 
Lu=— | A,(o)u?t%(o)do (6.35) 
P! Jo 
(cf. Chap. 3, Sect. 3.2.6). From the explicit formula 
k 
Ap@) =) less-o)h -Bsp-o)F"], pel (6.36) 
s=0 


it is easily seen that A, € So*(A), where A is the subdivision of [0,] into 
subintervals of length 1. Outside the interval [0,4] we have A, = 0. Moreover, 
if L is definite of order p, then, and only then (cf. Chap. 3, (3.80) and Ex.51), 


peti 
L ——_. 
(p+ 1)! 
The Peano representation (6.35) of L in combination with (6.33) allows us to 


identify the polynomial degree of a multistep method with its order as defined in 
Sect. 6.1.2. 


Lu = €y4;u?t9(G), 0<T<k; Cp = (6.37) 


Theorem 6.1.3. A multistep method (6.2) of polynomial degree p has order p 
whenever the exact solution y(x) of (6.1) is in the smoothness class C?*{a, b]. 
If the associated functional L is definite, then 


(Ti) n = £41 VPT) (Kq)h?, Xn < Xn < Xn+k> (6.38) 


where +1 is as given in (6.37). Moreover, for the principal error function t of the 
method, whether definite or not, we have, if y € C?*?[a, bl, 


ua) tei OG): (6.39) 


Proof. By (6.32) and (6.33), we have 
A(Th)n = Lu, u(t) = y(%, + th), n=0,1,2,...,N—k; 

hence, by (6.35), 

h 


pti k 

d. Ap(a)y?*) (Xn +oh)do. 
P! Jo 
(6.40) 


1 k 
KO)e= = i Ap(a)u?*Y (do = 
P! Jo 
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Therefore, 
hP : +1) 
(ial == i Ap(o)y?* (x, + oh)do 
+ | Jo 
h? : (p+) 
< a lA p(o)I| lly (Xn + oh)||do 
> JO 
hP (p+1) 
<9 eo / lAp(o) Ido, 
and we see that 
1 k 
|Talleo < CHP, C= 9° Mo 7. IA p(o) ldo: (6.41) 


This proves the first part of the theorem. 

If L is definite, then (6.38) follows directly from (6.33) and from (6.37) applied to 
the vector-valued function u in (6.32). Finally, (6.39) follows from (6.40) by noting 
that y?t)(x, +oh) = yt) (x,)+O(h) and the fact that a i Ap(o)do = ly41 
(cf. Chap. 3, (3.80)). Oo 


The proof of Theorem 6.1.3 also exhibits, in (6.41), an explicit bound on the local 
truncation error. For methods with definite functional L, there is the even simpler 
bound (6.41) with 


C= llp+illy?*? Ilo (L definite), (6.42) 


which follows from (6.38). It is seen later in Sect.6.3.4 that for the global 
discretization error it is not £ p+ Which is relevant, but rather 


l 
ee eee ae (6.43) 


ae 


which is called the error constant of the k-step method (6.2) (of order p). The de- 
nominator in (6.43) is positive if the method is stable and consistent (cf. Sect. 6.4.3). 
As a simple example, consider the midpoint rule 


Un+2 = Uy a 2hfh 


for which 
Lu = u(2) — u(0) — 2u'(1). 
We already know that it has order p = 2. The Peano kernel is 


Ao(o) = 2-0), - 401-0), 
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Fig. 6.1 Peano kernel of the midpoint rule 
so that 
(2—o0)* —4(1-0) =o’ if 0<o0 <1, 
Ax(o) = 


(2-0) if 1<o <2. 
It consists of two parabolic arcs as shown in Fig. 6.1. Evidently, L is positive 
definite; hence by (6.38) and (6.37), 


| 
(Th)n = b3yO Rh’, ia 6 ~ 3° 


6.2 Examples of Multistep Methods 


An alternative way of deriving multistep formulae, which does not require the 
solution of linear algebraic systems (as does the method of Sect. 6.1.2), starts from 
the fundamental theorem of calculus, 


Xx, 


n-+k 
y(Xntk) = Y(Xn+k-1) +f y'(x)dx. (6.44) 


Xn+k-1 


(Instead of x,4,-; on the right, we could take any other grid point X,+,, 
0 <x <k—1, but we limit ourselves to the case shown in (6.44).) A multistep 
formula results from (6.44) if the integral is expressed (approximately) by a linear 
combination of derivative values at some grid points selected from the set {X;+,: 
k = 0,1,...,k}. Those selected may be called the “active” grid points. A simple 
way to do this is to approximate y’ by the unique polynomial interpolating y’ at 
the active grid points. If we carry along the remainder term, we will also get an 
expression for the truncation error. 

We implement this in the two most important cases where the active grid points 
are Xn, Xptis---,Xn+k—1 ANd X41, Xn+2,---,Xn+k, respectively, giving rise to the 
family of Adams-type multistep methods. 
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6.2.1 Adams'—Bashforth Method 


Replace y’ in (6.44) by the interpolation polynomial p,_) (y's Xn, Xn+i,-++>Xn+k-15 
x) of degree < k — 1 interpolating y’ at the grid points x), Xn41,...,Xn+k—-1 
(cf. Chap. 2, Sect. 2.2.1). If we include the remainder term (cf. Chap. 2, Sect. 2.2.2), 
assuming that y € C**!, and make the change of variable x = X,4,—1 + th in the 
integral of (6.44), we obtain 


k-1 
Y(Xntk) = VOnte-1) FAY Be sy’ nts) + brn, (6.45) 
s=0 
where 
1 k-1 
t+k—l1-r 
Bis = f 1] (>= )«. s=0,1,...,k-1, (6.46) 
O- 25 sS—-fTr 
rs 
and 
Vftt+k—-1 
n= yn ye OG), ea Se eS / ( is , date G47 
0 
The formula (6.45) suggests the multistep method 
k-l 
Un+tk = Un+k—-1 + h Sat as Un+s)s (6.48) 
s=0 


which is called the kth-order Adams—Bashforth method. It is called kth-order, since 
comparison of (6.45) with (6.19) shows that in fact r, is the truncation error 
of (6.48), 


rn = (Th)n- (6.49) 


and (6.47) shows that 7, = O(h*). In view of the form (6.47) of the truncation error, 
we can infer, as mentioned in Sect. 6.1.3, that the linear functional L associated with 


'John Couch Adams (1819-1892), son of a tenant farmer, studied at Cambridge University where, 
in 1859, he became professor of astronomy and geometry and, from 1861 on, director of the 
observatory. His calculations, while still a student at Cambridge, predicted the existence of the 
then unknown planet Neptune, based on irregularities in the orbit of the next inner planet Uranus. 
Unfortunately, publication of his findings was delayed and it was Le Verrier, who did similar 
calculations (and managed to publish them), and young astronomers in Berlin who succeeded 
in locating the planet at the position predicted, who received credit for this historical discovery. 
Understandably, this led to a prolonged dispute of priority. Adams is also known for his work 
on lunar theory and magnetism. Later in his life, he turned to computational problems in number 
theory. An ardent admirer of Newton, he took it upon himself to catalogue a large body of scientific 
papers left behind by Newton after his death. 
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(6.48) is definite. We prove later in Sect. 6.4 (cf. Ex. 11(b)) that there is no stable 
explicit k-step method that has order p > k. In this sense, (6.48) with 6;.; given by 
(6.46) is optimal. 

If y € C**, it also follows from (6.47) and (6.49) that 


(Tin = Yer yO (Xn) + O11), > 0, 
that is, the Adams—Bashforth method (6.48) has the principal error function 
T(x) = rey P(x). (6.50) 


The constant y, (defined in (6.47) is therefore the same as the constant ;41 defined 
earlier in (6.37). 

Had we used Newton’s form of the interpolation polynomial (cf. Chap. 2, 
Sect. 2.2.6) and observed that the divided differences required for equally spaced 
points are (cf. Chap. 2, Ex. 54) 

1 


V* fntk-1s (6.51) 
sths 


[Xn+k—-1,Xn+k—2; oes Xn+k—s—f = 


where V fn4k—1 = fntk—-1 — fntk—2 V7 fngk—1 = V(V fr+k—1),... are ordinary 
(backward) differences, we would have obtained (6.48) in the form 


k=1 
Untk = Untk-1 th DY ysVS fntk-t, (6.52) 


s=0 


hte | 
n= f ( we de He OTD sac (6.53) 
0 AY 


The difference form (6.52) of the Adams—Bashforth method has important 
practical advantages over the Lagrange form (6.48). For one thing, the coefficients in 
(6.52) do not depend on the step number k. Adding more terms in the sum of (6.52) 
thus increases the order (and step number) of the method. Related to this is the fact 
that the first omitted term in the summation of (6.52) is a good approximation of the 
truncation error. Indeed, by Chap. 2, (2.117), we know that 


where 


yt DG,) ~ K\[Xn+e-1 >Xntk—2.+++5Xn, Xn—-ly’: 
hence, by (6.47), (6.49), and (6.51), 


vk Fntk-1 


(Th)n = veh y?O OG) & eh*k! kink 


= ieV" feet 
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When implementing the method (6.52), one needs to set up a table of (backward) 
differences for each component of f. By adding an extra column of differences at 
the end of the table (the kth differences), we are thus able to monitor the local 
truncation errors by simply multiplying these differences by yz. No such easy 
procedure is available for the Lagrange form of the method. 

It is, therefore, of some importance to have an effective method for calculating the 
coefficients y;, s = 0,1,2,.... Such a method can be derived from the generating 
function y(z) = )°°2y ysz° of the coefficients. We have 


fi ftts-1 ee ee: 
ya= Se f'( ; Ja = Soe | @z 


= [ X()eve= f'c-a- 


1 et Ind—z) i=l 
/ et Ind—2) qe a 
0 In( =<) 


II 


1=0 
a 
(1 —z)In(1 —z)' 
Thus, the y, are the coefficients in the Maclaurin expansion of 


z 


z) = -—————_... 6.54 
We ints Gon) 
In particular, 
z CO 
Tay => — Indl —2z) pares 
s=0 
that is, 
2,3 1,13 2 
ztetetes Hlrtsetac te Cae MiP Yar aes), 
which, on comparing coefficients of like powers on the left and right, yields 
yo = 1, 
= 1 1 1 
Vs = 5) Ys-1 3 Ys—2 re Yo 
GH 1,2,.3 5.006% (6.55) 


It is, therefore, easy to compute as many of the coefficients y, as desired, for 
example, 
a3 il i38 8 
VON Me = > v2 12’ B= gt 


Note that they are all positive, as they must be in view of (6.53). 
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6.2.2 Adams—Moulton” Method 


This is the implicit analogue of the Adams—Bashforth method, that is, the point 


77 


+x iS included among the active grid points. To obtain again a method of 


order k, we need, as before, k active points and therefore select them to be 


x 


The Lagrange form of the method is now 


k 
Un+k = Un+k—1 + h SB ee Un+s) 


s=l1 


t+k—1-r 
i= : oo = ) ar, Ce be eee 


r=1 


rs 


with 


whereas Newton’s form becomes 
k-1 
*x US 
Un+tk = Un+k-1 + h > Ys Vo fntk 


s=0 


0 
pe 
ee f(s J, 9 0.1.20 
=i AY 


The truncation error and principal error function are 


with 


(Ty) n = yiity? G.), Xn+1 < ro < Xn+k; =) = vey" Y@), 


and the generating function for the y* is 


lo.) 
* = oS __ Zz 
Y*@=)>oyrz =e 
s=0 


From this, one finds as before, 


+k» Xn+k—-1;+++,Xn+1- The derivation of the method is then entirely analogous 
to the one in Sect. 6.2.1, and we limit ourselves to simply stating the results. 


(6.56) 


(6.57) 


(6.58) 


(6.59) 


(6.60) 


(6.61) 


(6.62) 


Forest Ray Moulton (1872-1952) was professor of Astronomy at the University of Chicago and, 
from 1927 to 1936, director of the Utilities Power and Light Corp. of Chicago. He used his method 
during World War I and thereafter to integrate the equations of exterior ballistics. He made also 


contributions to celestial mechanics. 
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So, for example, 


1 1 1 


aly =--, = —_, i. — ir 
Yo Y1 ) Y2 12 Y3 74 


gece 


It follows again from (6.60) that the truncation error is approximately the first 
omitted term in the sum of (6.58), 


(Tin © ve V* fate. (6.63) 


Since the formula (6.58) is implicit, however, it has to be solved by itera- 
tion. A common procedure is to get a first approximation by assuming the last, 
(k — 1)th, difference retained to be constant over the step from X,4,—1 tO Xn+k, 
which allows one to generate the lower-order differences backward until one obtains 
(an approximation to) f+. We then compute u,4, from (6.58) and a new f+ 
in terms of it. Then we revise all the differences required in (6.58) and reevaluate 
un+x. The process can be repeated until it converges to the desired accuracy. In 
effect, this is the fixed point iteration of Sect.6.1.1, with a special choice of the 
initial approximation. 


6.2.3 Predictor—Corrector Methods 


These are pairs of an explicit and an implicit multistep method, usually of the same 
order, where the explicit formula is used to predict the next approximation and the 
implicit formula to correct it. Suppose we use an explicit k-step method of order k, 
with coefficients a,, 6;, for the predictor, and an implicit (kK — 1)-step method of 
order k with coefficients a*, 6*, for the corrector. Assume further, for simplicity, 
that both methods are definite (in the sense of (6.37)). Then, in Lagrange form, if 


° . . . . 
Uy+k 18 the predicted approximation, one proceeds as follows: 


k-1 k-1 
° 
Untk = — ¥ AsUnts + h > Ba Sn®s; 


s=0 s=0 


k~l é k~l (6.64) 
Untk = — > Ob Un+s +h Bef (Xntk; Un+k) F > Be Fnts ; 


s=1 s=l 


Sn+k = SF (Xnt+ks Unt+k)- 


This requires exactly two evaluations of f per step and is often referred to as 
a PECE method, where “P” stands for “predict, “E” for “evaluate,” and “C” 
for “correct.” One could of course correct once more, and then either quit, or 
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reevaluate f, and so on. Thus there are methods of type P(EC)”, P(EC)E, and the 
like. Each additional reevaluation costs another function evaluation and, therefore, 
the most economic methods are those of type PECE. 

Let us analyze the truncation error of the PECE method (6.64). It is natural to 
define it by 


k-1 


(io) ie = Ls: Y(Xnts) = By Sf Xnte, Vn+k) 7 ye y ‘Onts) (> 


s=1 s=1 
where a = | and 


k=l k=1 


y= _ dX Os Y(Xn+s) th >a Bsy "Gnesi 


s=0 


that is, we apply (6.64) on exact values Un+5 = Y(Xn+s), 8 = 0,1,...,k4 — 1. We 
can write 
k 


k 
Yo af y (nts) s >" Bry’ @nts) 


s=1 s=l1 


( 7) 


+ BE ly’ Ontn)— f Gade. Pnaddl 
= 2 Ue) 4 OS Guar) =F Caen 3 ls OOD 


having used the truncation error (T, ‘ )n Gin (6.38)) of the corrector formula. Since 


iG = yee = eek a) (6.66) 


is h times the truncation error of the predictor formula, the Lipschitz condition on 
f yields 


| F (Xn+e, Y(Xn+k)) = SF On+k, aee)| Ss Lille |nkt! |p &r Ilo; (6.67) 


and hence, from (6.65), we obtain 


TP loo S engl + AL +18 Dy? llooh* < Ch, 


where 


C= (ty + O- aL lcs BEDI yer” Ico 


Thus, the PECE method also has order k, and its principal error function is identical 
with that of the corrector formula, as follows immediately from (6.65) and (6.67). 
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The local truncation error (T, ob can be estimated in terms of the difference 


between the (locally) predicted approximation y, 4, and the (locally) corrected 
approximation w,,4,;. One has, indeed, by definition of the truncation error, 


k-1 k-1 
Untk = — So Giese) + h Crt Onan, Vue) + =~ B.Y Gnts) 


s=1 s=1 


= ¥Qnte) A(T), 


that is, 
Untk — Y(Xn+k) = a ya) + o(h*t?), 
whereas, from (6.66), 
Yb =9~q.g) ==) yO), 
assuming that y(x) € C*+? [a, b]. Upon subtraction, one gets 
Un+tk — Pak = 6 _ leah y®+D(,) + o(nkt?), 


and thus, 


1 
fh, = Lz h 


- 1 i 
yFFO GG) = Eq (Untk — Yn4n) + OCA). 


Since, by (6.65) and (6.67), 


(Ge), _ Core a) 4+ o(akty, 


1 


we obtain 


e* 1 ° ’ 
Ga Fag On), (6.68) 
fia — Get1 


rt 


(7; °°") a 


The first term on the right of (6.68) is called the Milne estimator of the PECE 
truncation error. 

The most popular choice for the predictor is a kth-order Adams—Bashforth 
formula, and for the corrector, the corresponding Adams—Moulton formula. Here, 


k-1 
Un+k = Untke—-1 + h Beste 
s=0 
; , kt (6.69) 
Unt+k = Untke—-1 + h Bix S (Xntk, Un+tk) + Ye oe 


s=l1 


Sn+k 
with coefficients Bx.s, Bg, as defined in (6.46) and (6.57), respectively. 


f (Xn+k, Un+tk)s 
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A predictor—corrector scheme is meaningful only if the corrector formula is more 
accurate than the predictor formula, not necessarily in terms of order but in terms 
of error coefficients. Since the principal error functions of the kth-order predictor 
and corrector are identical except for a multiplicative constant, 2,4) in the case of 
the predictor, and €{, ; in the case of the corrector, we want |C;°,.| < |Cx,x|, which, 
assuming Sect. 6.3.5, (6.117), is the same as 


legal < lextil- (6.70) 


For the pair of Adams formulae, we have €,41 = yx and &,, = yg (cf. (6.50), 
(6.60)), and it is easy to show (see Ex. 4) that 


: k>2 (6.71) 
k-1 Vk , . 
so that indeed the corrector has a smaller error constant than the predictor. More 
precisely, it can be shown (see Ex. 5) that 


Ivgl < 


: 1 
Ve ~ -—>—— as k > 00, (6.72) 


us Eick 


1 
Ink ’ 
although both approximations are not very accurate (unless k is extremely large), 
the relative errors being of order O(1/ Ink). 


6.3 Global Description of Multistep Methods 


We already commented in Sect.6.1.1 on the fact that a linear multistep method 
such as (6.2) represents a system of nonlinear difference equations. To study its 
properties, one inevitably has to deal with the theory of difference equations. Since 
nonlinearities are hidden behind a (small) factor A, it turns out that the theory of 
linear difference equations with constant coefficients will suffice to carry through 
the analysis. We therefore begin with recalling the basic facts of this theory. We then 
define stability in a manner similar to that of Chap. 5, Sect. 5.7.1, and identify a root 
condition for the characteristic equation of the difference equation as the true source 
of stability. This, together with consistency, then immediately implies convergence, 
as for one-step methods. 


6.3.1 Linear Difference Equations 


With notations close to those adopted in Sect.6.1 we consider a (scalar) linear 
difference equation of order k, 


Vatk + Ok—-1Vn4k—1 +++ + QoVn = Gntk, N= 0,12 85 (6.73) 
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where a; are given real numbers not depending on n and not necessarily with 
ao A 0, and {%,4+4}°2, is a given sequence. Any sequence {v,}°2, satisfying 
(6.73) is called a solution of the difference equation. It is uniquely determined by 
the starting values vo, V,,...,Vx~—1. Uf ay = a) = +++ = aye_) = 0,1 < 6 <k, then 
{Vn}n>k is not affected by vo,v1,..., ve—1.) Equation (6.73) is called homogeneous 
if Q,+4 = O for all n > 0 and inhomogeneous otherwise. It has exact order k if 


a #0. 


6.3.1.1 Homogeneous Equation 


We begin with the homogeneous equation 


Vntk + Ok-1Vn+k—1 +++ + Avy = 0, n=0,1,2,.... (6.74) 
We call 
k 
a= > at Ce=1) (6.75) 
s=0 
the characteristic polynomial of (6.74) and 
a(t) =0 (6.76) 
its characteristic equation. If t,, s = 1,2,...,k’ (k’ < k), denote the distinct roots 


of (6.76) and m, their multiplicities, then the general solution of (6.74) is given by 


kk’ fms—1 
a 63 ct t", n=0,1,2,..., (6.77) 


s=l1 r=0 


where c,, are arbitrary (real or complex) constants. There is a one-to-one correspon- 
dence between these & constants and the k starting values vo, vj, ..., Ve—1- 

We remark that if @ = 0, then one of the roots f, is zero, which contributes an 
identically vanishing solution (except for = 0, where by convention? = 0° = 1). 
If additional coefficients ~} = --- = ay_,, & < k, are zero, this further restricts the 
solution manifold. Note also that a complex root t; = pe!” contributes a complex 
solution component in (6.77). However, since the a, are assumed real, then with f, 
alsof, = pe? is a root of (6.76), and we can combine the two complex solutions 
n't" and n'f, to form a pair of real solutions, 


1 = : 1 = 
5 n'(t" +7,) =n" p" cosnd, Fi n'(t" —f/) =n" p" sinné. 


If we do this for each complex root ¢, and select all coefficients in (6.77) real, we 
obtain the general solution in real form. 
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The following is a simple but important observation. 


Theorem 6.3.1. We have |v,| < M, all n > 0, for every solution {v,} of 
the homogeneous equation (6.74), with M depending only on the starting values 
Vo, V1,--+5Vk—1 (but not on n) if and only if 


either |ts| <1 


ts) = 0 impli 
alts) mE Vee dl, tee, 


(6.78) 


Proof. Sufficiency of (6.78). If (6.78) holds for every root t; of (6.76), then 
every termn't” is bounded for all n > 0 (going to zero as n — ow in the first case of 
(6.78) and being equal to 1 in absolute value in the second case). Since the constants 
Crs in (6.77) are uniquely determined by the starting values vo,v1,...,Vx—1, the 
assertion |v,| < M follows. 

Necessity of (6.78). If |v,| < M, we cannot have |f,;| > 1, since we can 
always arrange to have c,s # 0 for a corresponding term in (6.77) and select all 
other constants to be 0. This singles out an unbounded solution of (6.74). Nor can 
we have, for the same reason, |f;| = 1 andm, > 1. Oo 


Condition (6.78) is referred to as the root condition for the difference equation 
(6.74) (and also for (6.73)). 

Representation (6.77) of the general solution is inconvenient insofar as it does 
not explicitly exhibit its dependence on the starting values. A representation that 
does can be obtained by defining k special solutions {h,,5}, s = 0,1,...,k — 1, of 
(6.74), having as starting values those of the unit matrix, that is, 


hn = 6n,5 forn=0,1,...,k —1, (6.79) 


with 6,,,, the Kronecker delta. Then indeed, the general solution of (6.74) is 


k-1 
Va = > Vshys. (6.80) 
s=0 


(Note that if a = 0, then Ao9 = 1 andh,o = O for all nm > 1; similarly, if 
Qo = a, = O, and so on.) 


6.3.1.2 Inhomogeneous Equation 


To deal with the general inhomogeneous equation (6.73), we define for each m = k, 
k+1,k+2,... the solution {gy}, of the “initial value problem” 
k 
Yaa tem = Snan—ks n= 0, 1, Deca 


s=0 


80m = Sim = °°* = 8k-1m = 0. (6.81) 
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Here the difference equation is a very special case of (6.73), namely, with g,+; = 
0 ifn Am-—k 
1 ifn=m—-k 
these special solutions to form the solution 


Snm-k = an “impulse function.” We can then superimpose 


n 


Van = > 8n,mPm (6.82) 
m=k 
of the initial value problem vp = vy = --- = vg_-; = O for (6.73) (Duhamel’s 
Principle). This is easily verified by observing first that g,» = 0 forn < m, so that 
(6.82) can be written in the form 
CO 
Vn = D> Sn Om. (6.83) 
m=k 


We then have 


k k love) 
> AsVnt+ts = [> Os > 8n+snPm 
s=0 s=0 


m=k 


lo) k 
= >. Pm > As 8n+sm = Pnt+ks 
s=0 


m=k 


where the last equation follows from (6.81). 

Since effectively (6.81) is a “delayed” initial value problem for a homogeneous 
difference equation with k starting values 0,0,...,0,1, we can express gy.) in 
(6.83) alternatively as Ay»—m+x—1.4—1 (cf. (6.79)). The general solution of the 
inhomogeneous equation (6.73) is the general solution of the homogeneous equation 
plus the special solution (6.82) of the inhomogeneous equation; thus, in view 
of (6.80), 


k-1 n 
Vn = Yo vshtn.s + > Nn—m+k—-1,.k—-1Pm- (6.84) 
s=0 


m=k 


Theorem 6.3.2. There exists a constant M > 0, independent of n, such that 


n <M S m , =0,1,2,..., 6.85 
lvnl < pan lvsl + DS lem n (6.85) 


m=k 


for every solution {v,} of (6.73) and for every {@n+«}, if and only if the character- 
istic polynomial a(t) of (6.73) satisfies the root condition (6.78). 
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Proof. Sufficiency of (6.78). By Theorem 6.3.1 the inequality (6.85) follows 
immediately from (6.84), with M the constant of Theorem 6.3.1. 

Necessity of (6.78). Take g, = 0, all m > k, in which case {v,} is a 
bounded solution of the homogeneous equation, and hence, by Theorem 6.3.1, the 
root condition must hold. Oo 


As an application of Theorems 6.3.1 and 6.3.2, we take v, = hy.s (cf. (6.79)) and 
Vn = Zn (cf. (6.81)). The former is bounded by M, by Theorem 6.3.1, and so is 
the latter, by (6.85), since g,,5 = 0 for 0 < s < k —1 and all g,, are 0 except one, 
which is 1. Thus, if the root condition is satisfied, then 


lhns| <M, |@nm| <M, all n> 0. (6.86) 


Note that, since h,, = 1, we must have M > 1. 


6.3.2 Stability and Root Condition 


We now return to the general multistep method (6.2) for solving the initial value 
problem (6.1). In terms of the residual operator R),, of (6.14) we define stability, 
similarly as for one-step methods (cf. Chap. 5, Sect.5.7.1), as follows. 


Definition 6.3.1. Method (6.2) is called stable on [a,b] if there exists a constant 
K > Onot depending on / such that for an arbitrary (uniform) grid h on [a, b] and 
for arbitrary two grid functions v, w € I’, [a, b], there holds 


lv —wlloo < ( max ||vs — wWs|| + || Rav — Rilo) v,w eT,|a, d], 
0<s<k—-1 
(6.87) 
for all A sufficiently small. 


The motivation for this “stability inequality” is much the same as for one-step 
methods (Chap. 5, (5.83)-(5.85)), and we do not repeat it here. 
Let F be the family of functions f satisfying a uniform Lipschitz condition 


If.) - SO. y* I < Lily —y*|, x € [ab], y.y*eR?, — (6.88) 


with Lipschitz constant L = Ly depending on f. 


Theorem 6.3.3. The multistep method (6.2) is stable for every f € F if and only 
if its characteristic polynomial (6.75) satisfies the root condition (6.78). 


Proof. Necessity of (6.78). Consider f(x,y) = 0, which is certainly in F, 
and for which 


k 
1 
(RiV)n = h 28M 
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Take v = u and w = 0 in (6.87), where wu is a grid function satisfying 


k 
cue OG ee 0122 (6.89) 
s=0 


Since (6.87) is to hold for arbitrarily fine grids, the integer 1 in (6.89) can assume 
arbitrarily large values, and it follows from (6.87) and R;,u = 0 that w is uniformly 
bounded, 


lu \lo0 < K max |lus|l, (6.90) 
0<s<k-1 


the bound depending only on the starting values. Since u is a solution of the 
homogeneous difference equation (6.74), its characteristic polynomial must satisfy 
the root condition by Theorem 6.3.1. 
Sufficiency of (6.78). Let f € F andv,w e€ I,[a, bd] be arbitrary grid 
functions. By definition of R;, we have 
k k 
Yo as¥n4s =h > Bs f (Xn+s.¥n4s) tACRav)n, n =0,1,2,...,N —k, 


s=0 s=0 
and similarly for w. Subtraction then gives 


k 
Yl eWeks — Wnts) =Pnak N= 0, 1,2,...,N —k, 


s=0 
where 


h 
Pnt+k = h > BsIf (Xn+s.¥n+s) — SF Ons; Wats) + A[(Rav)n = (Riw)n]. 


s=0 


(6.91) 


Therefore, vy — w is formally a solution of the inhomogeneous difference equation 
(6.73) (the forcing function g, ,,, though, depending also on v and w), so that by 
(6.80) and (6.83) we can write 


k-1 n 
Vn — Wr = 2 hn ss — Ws) + > 8n.mPm- 
s=0 


m=k 
Since the root condition is satisfied, we have by (6.86) 
lAns| <M, |8nm| <M 


for some constant M > 1, uniformly inn and m. Therefore, 


n 
lv» — Wall <M yk max |lys—wsll + D> only. 
0<s<k—-1 


m=k 


n=0,1,2,...,N. (6.92) 
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By (6.91) we can estimate 


k 
h > Bs lf (Xm—k+s> Vin—k-+s) _ f (Xm—k-+s> Wm—k-+s)] 


Om I = 
s=0 
+r h[(Rnv) m—k = (RaW) m—«] 
k 
= hBL ~e |Vn—k-+s a Wm—k+s || + h|| Rav = RiW|loo; (6.93) 
s=0 


where 
p= oo IBs|. 
Letting e =vy—wandr; = R,v — Ry,w, we obtain from (6.92) and (6.93) 


n k 
< 
len SM JK max fel +hBL YY lem-rrsll + NA rah 


m=k s=0 
Noting that 


nN 


k k n k n 
SS lente). leeds = DS lel 


m=k s=0 s=0 m=k s=0 m=0 


=(k+1) > lleml| 


m=0 
and using Nh = b —a, we get 


n 


lenl| < M Fe les || + A(k + 1)BL Sy lem || + (6 —a)|lralloog- (6.94) 


m=0 


Now let / be so small that 


1—h(k + 1)BLM > 


Nie 


Then, splitting off the term with |le,, || on the right of (6.94) and moving it to the left, 
we obtain 


(1 —h(k + 1)BLM)]len| 
n—1 


<M yk max esl +h + BLY lenll + &—a)liralooy. 


m=0 
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or 
n-l 
len] < 2M hk + 1BL YS Jlem|| + max jes] + (b - rate 
m=0 ees 
Thus, 
n-l 
len] <A D> llem|| + B. (6.95) 
m=0 
where 


A=2(k+1)BLM, B=2M (x, max llesl| + (b a)lrills) (6.96) 


Consider, along with (6.95), the difference equation 


n-1 
E, =hA > Em +B, Eo) = B. (6.97) 


m=0 


It is easily seen by induction that 
E, = BA +hA)", n=0,1,2,.... (6.98) 


Subtracting (6.97) from (6.95), we get 


n—1 


llen | = Ey, < hA (lem | — Em). (6.99) 


m=0 
Clearly, ||eo|| < B = Eo. Thus, by (6.99), ||e1 || — E1 <0, and by induction on n, 
llen|| < En, nm =0,1,2,.... 
Thus, by (6.98), 
llen|| < BA +A)" < Be™4 < Belb-94, 


Recalling the definition of B in (6.96), we find 


len] <2Me?4 Dk max |es|| + 6 —a)|| Rav — Ra¥llooy, 
O0<s<k—-1 
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which is the stability inequality (6.87) with 
K = 2Me®—4 max{k, b — a}. Oo 


Theorem 6.3.3 in particular shows that all Adams methods are stable, since 
for them 


aty= t=" = "@- 1, 


and the root condition is trivially satisfied. 

Theorem 6.3.3 also holds for predictor—corrector methods of the type considered 
in (6.64), if one defines the residual operator Re in the definition of stability in 
the obvious way: 


k k-1 

1 ° 
(Re v)n = h Yes — 4 Bet Xntks Vath) + = BSF nts. Ynts)¢ » 
s=l1 


s=1 


(6.100) 
with 
k-1 k-1 
Vnt+k = — een + h 3S Bs f (Xn45.¥n+s), (6.101) 
s=0 s=0 


and considers the characteristic polynomial to be that of the corrector formula 
(see Ex. 7). 

The problem of constructing stable multistep methods of maximum order is 
considered later in Sect. 6.4. 


6.3.3 Convergence 


With the powerful property of stability at hand, the convergence of multistep 
methods follows almost immediately as a corollary. We first define what we mean 
by convergence. 


Definition 6.3.2. Consider a uniform grid on [a, b] with grid length h. Letu = {u,} 
be the grid function obtained by applying the multistep method (6.2) on [a, b], with 
starting approximations uw, (h) as in (6.4). Let y = {y,} be the grid function induced 
by the exact solution of the initial value problem on [a, b]. Method (6.2) is said to 
converge on [a, b] if there holds 


Iz — yllo +0 as h—>0 (6.102) 


whenever 


us(h) > yo aa hh>O0, s =0,1,...,k —1. (6.103) 
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Theorem 6.3.4. The multistep method (6.2) converges for all f € F (cf. (6.88)) 
if and only if it is consistent and stable. If, in addition, (6.2) has order p and 
us(h) — y(xs) = O(h”), s = 0,1,...,k —1, then 


lz — ylloo = O(h’) ash 0. (6.104) 


Proof. Necessity. Let f = 0 (which is certainly in F) and yo = 0. Then 
y(x) = 0 and (6.2) reduces to (6.89). Since the same relations hold for each 
component of y and u, we may as well consider a scalar problem. (This holds for 
the rest of the necessity part of the proof.) Assume first, by way of contradiction, 
that (6.2) is not stable. Then, by Theorem 6.3.3, there is a root ft; of the characteristic 
equation a(t) = 0 for which either |f,| > 1, or |t;] = 1 andm, > 1. In the first case, 
(6.89) has a solution u, = ht!’ for which |u,| = h|t,|". Clearly, the starting values 
Up, Uj,...,Uz—1 all tend to yo = 0 as h — 0, but |u,| — oo asin — oo. Since for 
h sufficiently small we can have n arbitrarily large, this contradicts convergence of 
{uy} to the solution {y,}, y, = 0. The same argument applies in the second case if 
we consider (say) u, = hin™s—1p", where now |ft;| = 1, m; > 1. This proves the 
necessity of stability. 

To prove consistency, we must show that a(1) = 0 and a’(1) = A(1), where 
(anticipating (6.122)) we define B(t) = y Bst*. For the former, we consider 
f = 0, yo = 1, which has the exact solution y(x) = 1 and the numerical 
solution still satisfying (6.89). If we take u, = 1, s = 0,1,...,k — 1, the assumed 
convergence implies that u,4+; — 1 ash — 0, hence 0 = ae QsUn+s —> a(1). 
For the latter, we consider f = 1 and yo = 0, that is, y(x) = x — a. The multistep 
method now generates a grid function u = {u,,} satisfying 


k 
eee — BAYA = 0. 


s=0 


BC 


A particular solution is given by un = a7 


shown, we have 


jnh. Indeed, since w(1) = 0, as already 


: BO) 
>) Ostn+s — BCA = way" Yo as(n +5) — BAYA 
s=0 s=0 
= BY) h[na(1) + e@’(1)] — BU)A 
a’(1) 
_ Bd), , _ 
= way (1) — B()h = 0. 
Since also us > yo = 0 forh > 0,5 = 0,1,...,k4 — 1, and since by assumption 


{u,} converges to {y,}, ¥, = nh, we must have ae = |, that is, w’(1) = BC). 
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Sufficiency. Letting vy = u andw = y in the stability inequality (6.87), and 
noting that R,w = 0 and R;,y = Thy, the grid function {(7),),} of the truncation 
errors (cf. (6.15)), we get 


lu — Ylloo Ky max lus — y(xs)|1 + | Trllooe- (6.105) 
0<s<k—-1 


If method (6.2) is consistent, the second term in braces tends to zero and, therefore, 
also the term on the left, if (6.103) holds, that is, if wu; — y(x;) > 0 as h = 0. This 
completes the proof of the first part of the theorem. The second part follows likewise, 
since by assumption both terms between braces in (6.105) are of O(h”). Oo 


In view of the remark near the end of Sect. 6.3.2, Theorem 6.3.4 holds also for 
the kth-order predictor—corrector method (6.64) with p = k. The proof is the same, 
since for the truncation error, || TP°||5, = O(h*), as was shown in Sect. 6.2.3. 


6.3.4 Asymptotics of Global Error 


A refinement of Theorem 6.3.4, (6.104), exhibiting the leading term in the global 
discretization error, is given by the following theorem. 


Theorem 6.3.5. Assume that 

(1) f(x, y) € C? on [a,b] x R¢; 

(2) the multistep method (6.2) is stable (i.e., satisfies the root condition) and has 
order p = 1; 

(3) the exact solution y(x) of (6.1) is of class C?*?[a, b]; 

(4) the starting approximations (6.4) satisfy 


us — y(xXs) = O(h?*!) as h>0, s=0,1,...,k-1; 


(5) e(x) is the solution of the linear initial value problem 


< = fy(x, y(x)e —y?t(x), e(a) =0. (6.106) 


Then, forn = 0,1,2,...,N, 
Un — Y(Xn) = Cyph?e(xn) + O(h?*!) as h > 0, (6.107) 


where Cx,p is the error constant of (6.2), that is, 


1 
Ly+1 ter 


C., = ——.,, 2 = L ——_—_. 
kip Soe pti (p+ 1! 


(6.108) 
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Before proving the theorem, we make the following remarks: 


1. Under the assumptions made in (1) and (3), the solution e of (6.106) exists 
uniquely on [a, b]. It is the same for all multistep methods of order p. 

2. The error constant C;,, depends only (through the coefficients a,, B,) on the 
multistep method and not on the initial value problem to be solved. 

3. For a given differential system (6.1), the asymptotically best k-step method of 
order p would be one for which |Cx,p| is smallest. Unfortunately, as we see later 
in Sect. 6.4.3, the minimum of |C;.,,| over all stable k-step methods of order p 
cannot be generally attained. 

4. Stability and p > 1 implies pan B; 4 0. In fact, w(1) = yg a, = 0, since 
Li =0,and Lt = eee SOs — Ss B; = 0 since p > 1. Consequently, 


k k 
> Bs = do sas =a'(1) £0 (6.109) 
s=0 s=0 
by the root condition. Actually, hs B; > 0, as we show later in Sect. 6.4.3. 
Proof of Theorem 6.3.5. Define the grid function r = {r,,} by 


r=h?(u—y). (6.110) 


Then 
lee ie ise 
% ye =h? E yaaa, _ i Deven] 
s=0 s=0 s=0 


i: k 
=hP bp Bs f (Xn+s Un+s) _ ye Bs y'(Xn+s) = Te a 


s=0 s=0 


where 7), is the truncation error defined in (6.19). Expanding f about the exact 
solution trajectory and noting the form of the principal error function given in 
Theorem 6.1.3, we obtain 


l k k 
fa oe Aslhnts = hP b BAS Ones: y(Xn+5)) 
s=0 s=0 
+ fy Xn+5,¥ Xn4s))(Un+s — VO%nts)) + O(h??)} 
k 


am Bs y'(Xnts) — Lyi yt) (xp )h? ns owe") 


s=0 


having used Assumption (1) and the fact that uj4; — y(%+s5) = O(h?) by 
Theorem 6.3.4. Now the sums over f and y’ cancel, since y is the solution of 
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the differential system (6.1). Furthermore, O(h??) is of O(h?*!') since p > 1, and 
making use of the definition (6.110) of r, we can simplify the preceding to 


k k 
1 = > 
h ) Asvnts = he Bs fy (Xn+s; Y(Xn+s))hPrnss 
s=0 


s=0 


a Lorry Pt X_)h? + our) 


k 


= Bs fy (Xn+s, V(Xn+5)) nts Lyi y?*) (xn) al O(A). 
s=0 


Now 


II 
M>- 


k 
ba Bsy?*) Cents) 


s=0 s=0 


k 
= (> 6) yP Man) + OCF), 


Bs [y?* Gn) + OC] 


so that 


e k 
ly PMO) = sa YBs¥?* (ngs) + OC) 
s=0FS s=0 


k 
= Crp Y)Bsy Pt? Ants) + OCF), 


s=0 


by the definition (6.108) of the error constant Cx,,. Thus, 


k k 
1 
i > Aslhn+s — >, Bs [fy (Xn+s, Y(Xn+s)) nts = Cee Ona) = O(h). 
s=0 s=0 


Defining 
° 1 
r= ——r, (6.111) 
Ck, p 
we finally get 
1 k k 
h ys Asln+s — > Bs fy (Xn+s; Y(Xn+s5)) nts ~ yO) Ones) = O(h). 
s=0 s=0 


(6.112) 
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The left-hand side can now be viewed as the residual operator Ri of the multistep 


method (6.2) applied to the grid function r, not for the original differential system 
(6.1), however, but for the linear system (6.106) with right-hand side 


g(x,e):= fy(x, y(x))e —y?tD(y), a<x<b, e eR’, (6.113) 
That is, 
RF Flo = Oh). (6.114) 
For the exact solution e (x) of (6.106), we have likewise 
Rj; lloo = O(h), (6.115) 


since by (6.15) (applied to the linear system e’ = g(x, e)) Ri is the truncation error 
of the multistep method (6.2), and its order is p > 1. Since by Assumption (2) the 
multistep method (6.2) is stable, we can apply the stability inequality (6.87) (for the 


system e’ = g(x, e)) to the two grid functions vy = randw =e, giving 


fo} fo} fo} 
lr —elloo < x( max \F— eC + IRE — Rf eles) 
0<s<k-1 


K( max |r, —e(x,)|| + on) (6.116) 
0<s<k-1 
It remains to observe that, forO < s<k—1, 


° 1 
rs—e(xXs) = ry —e(Xs) = CG h-? (us — y(xs)| — e (xs), 
ey 


1 
Cry 
and hence, by Assumption (4) and e(a) = 0, 

rs —e(Xxs) = O(h) — [e(a) + she’ (E,)] = O(n), 
to conclude 

max IIr's — e(x;)|| = O(A) 

O0<s<k—-1 


and, therefore, by (6.116), (6.111), and (6.110), 


lu — y — Cy ph? ello = O(h?*"), 


as was to be shown. Oo 


The proof of Theorem 6.3.5 applies to the predictor—corrector method (6.64) 
with Cx,» replaced by Cf, = €¢,,/ am f*, the error constant for the corrector 


formula, once one has shown that wy4~ — Unie = O(h**!) in (6.64), and 
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y, 4k—-Y(Xn+k) = O (h‘*!), The first relation is true for special predictor—corrector 
schemes (cf. (6.17), (6.120)), whereas the second says that h times the truncation 
error of the predictor formula has the order shown (cf. Sect. 6.2.3). 


6.3.5 Estimation of Global Error 


It is natural to try, similarly as with one-step methods (cf. Chap. 5, Sect.5.8.1), to 
estimate the leading term in the asymptotic formula (6.107) of the global error by 
integrating with Euler’s method the “variational equation” (6.106) along with the 
multistep integration of the principal equation (6.1). The main technical difficulty is 
to correctly estimate the driving function y’*) in the linear system (6.106). It turns 
out, however, that Milne’s procedure (cf. (6.68)) for estimating the local truncation 
error in predictor—corrector schemes can be extended to estimate the global error 
as well, provided the predictor and corrector formulae have the same characteristic 
polynomial, more precisely, if in the predictor—corrector scheme (6.64) there holds 


Oo =o, Tors = 19h HO. (6.117) 


This is true, in particular, for the Adams predictor—corrector scheme (6.69). We 
formulate the procedure in the following theorem. 


Theorem 6.3.6. Assume that 


(1) f(x, y) € C? on [a,b] x R¢; 

(2) the predictor—corrector scheme (6.64) is based on a pair of kth-order formulae 
(kK = 1) satisfying (6.117) and having local error constants €;+41, €4., for the 
predictor and corrector, respectively; 

(3) the exact solution y (x) of (6.1) is of class C**?{a, b]; 

(4) the starting approximations (6.4) satisfy 


Us — Y(Xs) = o(nkt!y as h—>0, s=0,1,...,k-1; 


(5) along with the grid function u = {u,} constructed by the predictor—corrector 
scheme, we generate the grid function v = {v,} in the following manner (where 


Un+k is defined as in (6.64)): 
vy, =0, s=0O,1,..., k—-1; 


h-k+) 3 
Vntk = Vn+k-1 + h ty (Xn, Un)Vn + a (Un+tk _ Un+k) ; 
k+1~ “k+1 
n=0,1,2,...,.N—k. 
(6.118) 
Then, forn = 0,1,..., N, 
Un — Y(X%) = CC. hy; + O(h**)) as h 0, (6.119) 


where C;*,, is the (global) error constant for the corrector formula. 
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Proof. The proof is the same as that of Theorem 5.8.1, once it has been shown, in 
place of Chap. 5, (5.117), that 


p-&t) 


fe} 


a Ung — Ung) = YE On) + OF). (6.120) 
ley — Seti 


We now proceed to establish (6.120). 
By (6.64), we have 


k-1 k-1 

2 * 

Untk — Un+k = ) aA, Un+s — ) AsUnt+s 
s=l1 s=0 


k=1 , k-1 


+h = Bs f (Xn+s; Unts) — Boy Gas, Un+k) = 2 Bet nts: Un+s) : 
s=0 s=l1 


The first two sums on the right cancel because of (6.117). (Itis here where a = 0 is 
used.) In the expression between braces, we expand each f about the exact solution 
trajectory to obtain 

k-1 


top = a Bs Lf (Xn+s>¥Xnts)) + fy Xnts,¥ nts) (Unts — ¥Xn+s)) 


s=0 


+ O(h*)] 


—BE[S Cintke ¥On40) + Sy Crh I Cte) Uinek — WOn+e)) + OC) | 
k-1 


~ Br Racrere Y(Xn+s))+ Sy (Xn tS> Y(Xn | s)) (Un ts—Yy (Xn | s)) + o(h*)] 


s=l 


k-1 k 
= Ps¥ Gaea) — »: B.Y¥ Gnas) 
s=0 s=1 
k-1 
+ > Bs fy Xnt+5,¥ (Xn+s))(Un+s — Y(Xn+s)) 
s=0 
— Bi ty (Xn tks Y(Xn t i) (tn tk — V(Xn+k)) 
k-1 


~ > By ty (Xn+s; Y(Xn+s)) (Uns = Y(Xn+s)) + o(hkt), (6.121) 


s=l 


since O(h”*) is of O(h*+!) when k > 1. 
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Now from the definition of truncation error, we have for the predictor and 
corrector formulae 


l k-1 k-1 
h E (Xn+k) + Sarton o > Bs ¥’ Cats) = (Th)n- 


s=0 s=0 


I k-1 k 
" fro) +> errs) — 0 BT’ Gants) = Tn. 
s=1 


s=l1 


Upon subtraction, and using (6.117), we find 


k-1 k 
Ds Bex nis) — > Bry Gnas) = (Ti) n _ (Th)n 
s=0 s=l 


= (Ci 41 a Lesh iy?" Gq) ar O(h'*"). 


It suffices to show that the remaining terms in (6.121) together are O(h**'). This 
we do by expanding f; about the point (x,, y(x,)) and by using, from (6.107) 
and (6.108), 


Un+ts — Y(Xn+s) = Ci h* een) 1 ont) 


as well as 
k-1 k-1 
Un+k — Y(Xn+k) => > As (Un+s = Y(Xn+s)) + h > Bs lf (Xnckss Un+s) 
s=0 s=0 


_ Sf (Xn4s; V(Xn+5))] _— h(Th)n 
k-1 
_ yy as [Crake (Xn+s) - ocak*!)| + o(nst!) 


s=0 


II 


k-1 
= (x «) Ci hke (xn) + OCT!) = Ch hk e (en) + OT), 


s=0 


where in the last equation we have used that ar @;=1+ aeee as; = 0. We see 
that the terms in question add up to 


k-1 k-1 
WCE Ly Sn Yn) Je On) | > Bs — Bi - DBs + OHS). 
s=0 s=1 


Since k > 1, we have Lt = L*t = 0 for the functionals L and L* associated with 
the predictor and corrector formula so that the expression in braces is 


k 


k-1 k k k 
DBs — DUBS = D [sas — ) [sas = | sas — af) = 0, 
s=0 s=1 s=0 s=1 


s=1 


again by (6.117). This completes the proof of Theorem 6.3.6. O 
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The formula (6.120) can also be used to estimate the local truncation error 
in connection with step control procedures such as those discussed for one-step 
methods in Chap.5, Sect.5.8.3. Changing the grid length in multistep methods, 
however, is more complicated than in one-step methods, and for this we refer to 
the specialized literature. 


6.4 Analytic Theory of Order and Stability 


We now turn our attention to the following problems. 


(1) Construct a multistep formula of maximum algebraic degree, given its char- 
acteristic polynomial a(t). Normally, the latter is chosen to satisfy the root 
condition. 

(2) Determine the maximum algebraic degree among all k-step methods whose 
characteristic polynomials a(t) satisfy the root condition. 


Once we have solved problem (1), it is in principle straightforward to solve (2). 
We let a(t) vary over all polynomials of degree k satisfying the root condition and 
for each a construct the multistep formula of maximum degree. We then simply 
observe the maximum order so attainable. 

To deal with problem (1), it is useful to begin with an analytic characterization 
of algebraic degree — or order — of a multistep formula. 


6.4.1 Analytic Characterization of Order 


With the k-step method (6.2) we associate two polynomials, 


k k 

anj=> at PO= > Bt Ge=), (6.122) 
s=0 s=0 

the first being the characteristic polynomial already introduced in Sect. 6.3.1, (6.75). 

We define 


(¢) = oe Bo). tec, (6.123) 


which, since a(1) = 0, is a function holomorphic in the disk |¢ — 1] < 1. 


Theorem 6.4.1. The multistep method (6.2) has (exact) polynomial degree p if and 
only if 6(¢) has a zero of (exact) multiplicity p at € = 1. 


Proof. In terms of the linear functional 


k 
Lu= ¥_ [asu(s) — Bsu'(s)]. (6.124) 


s=0 
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method (6.2) has exact polynomial degree p if and only if (cf. Sect. 6.1.3, (6.35), 
where, without restriction of generality, we may assume scalar functions u) 


[. 4 ar i 
ie i Ap (o)u?*) (a)do, oe / Ay(G)do = La4 - 0, (6.125) 
P! Jo P! Jo 


for every u € C”t![0, k]. Choose u(t) = e”, where z is a complex parameter. Then 
(6.125) implies 


k zt pk 
* la,e* — B,ze™*] = —— A p(a)je**do, 
s=0 p} 0 


that is, 
a(e*) 


Zz 


. gf? 
B(e) = a ij A p(a)e**do. (6.126) 
P+ Jo 


Since the coefficient of z? on the right, when z = 0, equals @ pti # 0, the function 
on the left — an entire function — has a zero of exact multiplicity p at z = 0. Thus, 
exact polynomial degree p of L implies that the function 5(e*) has a 0 of exact 
multiplicity p at z = 0. The converse is also true, since otherwise, (6.125) and 
(6.126) would hold with a different value of p. The theorem now follows readily by 
applying the conformal map 


Ce, <Slnc (6.127) 


and by observing that the multiplicity of a zero remains unchanged under such 
a map. Oo 


Based on Theorem6.4.1, the first problem mentioned now allows an easy 
solution. Suppose we are given the characteristic polynomial a(t) of degree k and 
we want to find B(t) of degree k’ (< k) such that the method (6.2) has maximum 
order. (Typically, k’ = k —1 for an explicit method, and k’ = k for an implicit one.) 
We simply expand in (6.123) the first term of 6(¢) in a power series about ¢ = 1, 


ONS) ae gi Deh 1) ace, (6.128) 
In € 


and then have no other choice for 6 than to take 
BE) =cotaE—I +--+ erE—)*. (6.129) 


In this way, 6(¢) has a zero of maximum multiplicity at € = 1, given a(t) and the 
degree k’ of f. In fact, 


8(f) = cer gi(— 1) t1 +---, 


6.4 Analytic Theory of Order and Stability 435 


and we get order p > k’ + 1. The order could be larger than k’ + 1 if by chance 
ce’4+1 = 0. If p is the exact order so attained, then 


6(€) =c,(-1)? +---, cp #0, p=k' +1. (6.130) 


It is interesting to compare this with (6.126): 


ange peo 
Hy = / Aploye?de 


—_,-1 —|)j?4.-.-.]P k 
~Eotobe-9 Ff avco]r+ (*\e-v + to 


k 
= (= | (oy (C= 1 pee 


= bra — YP +--+. 


Thus, 
ln41 = Cp. (6.131) 
Similarly, if the method is stable, then 
£ L 
ee ee (6.132) 


~ yk Bs BU) eo 


We see that the local and global error constants can be found directly from expansion 
(6.128). It must be observed, however, that (6.131) and (6.132) hold only for the 
k-step methods of maximum degree. If the degree p is not maximal, then 


lpti = dp (6.133) 
and 
dp 
Cyp = —, (6.134) 
Co 
where 
6(6) =d,(G—-1)? +--+, dp #0. (6.135) 


It seems appropriate, at this point, to observe that if a(¢) and 6(¢) have a 
common factor w(¢), 


a(f) = w(f)ao(%), BC) = (6) Bo(?), 
and w(1) ¥ 0, then 


a(t) 


8(f) = o(6)d0(E), 50S) = int 


— Bo(S), 
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and 69(¢) vanishes at ¢ = 1 with the same order as 6(¢). The multistep method 
{a, B} and the “reduced” multistep method {a, Bo} therefore have the same order 
and indeed the same error constants (6.133) and (6.134). On the other hand, 
w(1) = 0 would imply 6(1) = 0, and the method {a, 8} would not be stable 
(cf. Sect. 6.3.4, (6.109)). Since, on top of that, a solution of the difference equation 
(6.2) corresponding to {a@, Bo} is also a solution of (6.2) for {a, 8} (cf. Ex.8), 
it would be pointless to consider the method {a, 6}. For these reasons it is no 
restriction to assume that the polynomials a(¢), B(¢) are irreducible. 


Example. Construct all stable implicit two-step methods of maximum order. 
Here, 


a=C=)G—A), =-1sa<1, 


since | is always a zero of a(¢) and the second zero A must satisfy the root condition. 
For the expansion (6.128) we have 


af) _ G=- 12+ -AG=1) | 1-A+¢-1) 
Ing §=1-3-12 + 1-9 E-D 49-1 = 


=coter(S—N+e(E-VP ++. 


An easy calculation gives 


1 1 
co=1l-A, a= _O-A) Q= Tica 


1 1 
=-—(1+A = —(114+19A). 
G3 = 31+), co = Sa (M1 + 192) 
Thus, 
a A 2—2A Lap 5A 
a aa aa 71 Ci ae oe 


giving the desired method 


5+A 2-—2A 1+5A 
Un+2 — ad + A)un+1 + Muy, = h “12, Sn+2 + 3. Siti = 12 


Sn 


If cz 4 0 (ie., A # —1), the order is exactly p = 3, and the error constant is 


C3 1 1 + Xr 
c= ied Se Y 
The case A = —1 is exceptional, giving exact order p = 4 (since cy = = hy # 0); 


in fact, 
h 
Un+2 — Uy = 3 (fn+2 + 4 fn+i + Sin) (A = -1) 


is precisely Simpson’s rule. This is an example of an “optimal” method — a stable 
k-step method of order k + 2 for k even (cf. Sect.6.4.2, Theorem 6.4.2(b)). 
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Example. Construct a pair of two-step methods, one explicit, the other implicit, both 
having a(¢) = ¢* — € and order p = 2, but global error constants that are equal in 
modulus and opposite in sign. 

Let 6(¢) and B*(C) be the B-polynomials for the explicit and implicit formula, 
respectively, and C22, Cz’, the corresponding error constants. We have 


ag) _ 6-1) 
Inf €-1-5(€-1)4+-- 
Thus, 


aie eed Sead te. 
=1456-D+ 56-4. 


3 3 1 
BO) =14 56-D=56-5, 
giving 
5 
OQ.2= a = = 
For 6*, we try 
3 
Br) = 14 56-)+0C-0". 


As we are not aiming for optimal degree, we must use (6.134) and (6.135) and find 


5 
2 —b os) 
cr, = 2 = —~_b, 
on 1 12 
Since we want Cy, = —C22, we get 
5 5 5 
2 p=-2, b=2, 
12 12 6 
and so, 


"“QO=142¢-N42¢-2 = 2-1 ey} 
Bate, 6 C=) =e ae 


The desired pair of methods, therefore, is 


h 
Unt2 = Un+i1 + 2 (3 fn4i — Sn); 
(6.136) 


* * h * * * 
Unt. = Uj+ ar 6 SSi+2 = Sti + 2h, ). 


The interest in such pairs of formulae is rather evident: if both formulae are used 
independently (i.e., not in a predictor—corrector mode, but the corrector formula 
being iterated to convergence), then by Theorem 6.3.5, (6.107), we have 


uy — Y(Xn) = Co 2h*e (Xn) ap O(n’), 
ur — (Xn) = —Co2h?e(x,) + O(h’); (6.137) 


that is, asymptotically for h — 0, the exact solution is halfway between u,, and 
u*. This generates upper and lower bounds for each solution component and built- 
in error bounds 5 |u* — u,| (absolute value taken componentwise). A break-down 
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occurs in the ith component if e!(xX,) ~ 0; if e!(x) changes sign across x, the 
bounds switch from an upper to a lower one and vice versa. 

Naturally, it would not be difficult to generate such “equilibrated” pairs of 
formulae having orders much larger than 2 (cf. Ex. 9). 


Example. Given a(t), construct an explicit k-step method of maximum order in 
Newton’s form (involving backward differences). 
Here we want L in the form 


k k-1 
Lu= y asu(s) — » ysV'u'(k — 1). 
s=0 s=0 


The mapping ¢ = e* used previously in (6.127) is no longer appropriate here, 
since we do not want the coefficients of 6(¢). The principle used in the proof of 
Theorem 6.4.1, however, remains the same: we want 4 Lye’ to vanish at z = 0 
with multiplicity as large as possible. Now : 


1 1 k k-1 
tz SZ s tZ 
pas aaa Dae 7 2 rlVine ji=k—-1 
s= s= 


and 
Vine™ = ef — ef “1% = ef om 
eé : 
_— _(e&-1)\° 
Vine” =e ( & ) 
Therefore, 
k-1 Z 

1 7 a(e*) _t\e e—1)\° 

fet = — elk-Dz ; . 6.138 

~ Lu : 2 vs | — (6.138) 


which maps a neighborhood of z = 0 conformally onto a neighborhood of ¢ = 0. 
Thus, (6.138) has a zero at z = 0 of maximal multiplicity if and only if 


ale 


I k-1 . 
—nl-  G-OF! dns 


1 
a-9ra(—) = 
82 yt 
s=0 


~ G-Oe1 “hd =0 
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has a zero at € = 0 of maximal multiplicity. Thus, we have to expand 


1 
eee) ae 
= oo all 
=e 2, Ysb (6.139) 
and take 
VYs=Yes, 8S=0,1,...,k-1. (6.140) 


To illustrate, for Adams—Bashforth methods we have 
a(t) = ¢* — 21, 
hence 
1 
d=" (<2) 
ae CC ae ¢ 
= Int =.) =(1=¢) Ing] =) 


which is the generating function 


= y(%), 


yO= > yt! 
s=0 


obtained earlier in Sect. 6.2.1, (6.54). We now see more clearly why the coefficients 
ys are independent of k. 


Example. Given a(t), construct an implicit k-step method of maximum order in 
Newton’s form. 
Here, 


k 
Lu = )_ [asu(s) — ye Viu'(k)]. 


s=0 


and a calculation similar to the one in the previous Example yields 


yay. =O) cake, (6.141) 
where 
1 
a-o(T) 
na) vee. (6.142) 
s=0 


Again, for Adams—Moulton methods, 


aoa (7-5) ¢ 


nd) > cap ©: 
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with 
YO= yee 
=0 


the generating function found earlier. 


Example. Given B(t), construct a k-step method of maximum order in Newton’s 
form. 

In all previous examples we were given a(f) and could thus choose it to satisfy 
the root condition. There is no intrinsic reason why we should not start with f(t) 
and determine w(t) so as to maximize the order. It then needs to be checked, of 
course, whether the a(t) thus found satisfies the root condition. Thus, in the example 
at hand, 


k k 
Lu= > ysVoulk) — > Bsu'(s). (6.143) 


s=0 s=0 


Following the procedure used previously, we get 


1 ek e—1)\° 
— L eé a ‘ —— _ e& 
z (t) Z d Ys ( e ) Ble’) 


1 = 1 
~ == OF nd — Lvs cera 


and we want this to vanish at € = 0 with maximum order. Clearly, yo = 0, and if 


~= 2 nar = 296 (=p) = Dae (6.144) 
s=l 


we must take the remaining coefficients to be 
Vee dig SH 1,2 yescgk, (6.145) 


to achieve order p > k. 
A particularly simple example obtains if B(t) = t*, in which case 


—a= 9 n= 96 (Pz) =-md =F SP 


so that 


w=0, Yo, S=12..4k. (6.146) 


The method 


1 1 
Vinge + 5 Vung Fo FE V" ttnee = hfrre (6.147) 
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of order k so obtained is called the backward differentiation method and is of 
some interest in connection with stiff problems (cf. Sect.6.5.2). Its characteris- 
tic polynomial a(t) is easily obtained from (6.143) by noting that Vu(k) = 
=o (—1)’ ()u(k — r). One finds 


k k 
1 
=) ot’, Ost, «=> =4 
s=0 


r=1 


“f[k-s+r 1 
= k—-s = 
a; = (-l1) ) ( a VR atte ko (6.148) 


r=0 


Note here that a, 4 1, but we can normalize (6.143) by dividing both sides by ax. 
It turns out that a(t) in (6.148) satisfies the root condition fork = 1,2,...,6 but 
not for k = 7, in which case one root of a lies outside the unit circle (cf. Ex. 10(a)). 

We remark that computer algebra systems such as Maple are very useful to 
implement series expansions of the type considered in this section; see, e.g., MA 3. 


6.4.2 Stable Methods of Maximum Order 


We now give an answer to problem (2) stated at the beginning of Sect. 6.4. 


Theorem 6.4.2. (a) If k is odd, then every stable k-step method has order p < 
kK +1. 

(b) If k is even, then every stable k-step method has order p < k + 2, the order 
being k + 2 if and only if a(t) has all its zeros on the circumference of the unit 
circle. 


Before we prove this theorem, we make the following remarks. 


1. In case (a), we can attain order p = k + 1 for any given a(t) satisfying the root 
condition; cf. Sect. 6.4.1, (6.130) with k’ = k. 

2. Since a(t) is a real polynomial, all complex zeros of a occur in conjugate pairs. 
It follows from part (b) of Theorem 6.4.2 that p = k + 2 if and only if a(t) has 
zeros att = 1 andt = —1, and all other zeros (if any) are located on |t| = 1 in 
conjugate complex pairs. 

3. The maximum order among all k-step methods (stable or not) is known to be 
DP = 2k. The stability requirement thus reduces this maximum possible order to 
roughly one-half. 


Proof of Theorem 6.4.2. We want to determine a(t), subject to the root condition, 
such that 
a(¢) 


6(¢) = ine B®) (6.149) 
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has a zero at € = | of maximum multiplicity (cf. Theorem 6.4.1). We map the unit 
disk |¢| < 1 conformally onto the left half-plane Re z < 0 (which is easier to deal 
with) by means of 

_ iz — f-1 
lz E400 
This maps the point € = 1 to the origin z = 0 and preserves multiplicities of zeros 
at these two points. The function 6(€) in (6.149) is transformed to 


d(z) = (3) -» (172) cc, 
eas 


g 


(6.150) 


except for the factor [(1 —z)/2]* which, however, does not vanish at z = 0. We write 


a(z) 


1 
In +z 


d(z) = — b(z), (6.151) 


1-z 


where 


1-z\* (1+z 1-z\" ,(1+z 
= , b2=|—— 6.152 
ae ( 2 )«() e ( 2 ) (22) re 
are both polynomials of degree < k. 
Our problem thus reduces to the following purely analytical problem: How many 
initial terms of the Maclaurin expansion of d(z) in (6.151) can be made to vanish if 


a(z) is to have all its zeros in Rez < 0, and those with Rez = 0 are to be simple? 
To solve this problem, we need some preliminary facts: 


(i) We have a(1) = 1 and a(0) = 0. This follows trivially from (6.152), since 
a(t) has leading coefficient 1 and a(1) = 0. 

(ii) The polynomial a(z) has exact degree k, unless € = —1 is a zero of (6), 
in which case a(z) has degree k — jz, where yu is the multiplicity of the zero 
¢ = —1. (If the root condition is satisfied, then, of course, 4 = 1.) This also 
follows straightforwardly from (6.152). 

(iii) Let 


a(z) = az an? +++» + aye", 


where a; # 0 by the root condition. If a(z) has all its zeros in Rez < 0, then 
ds > Ofors = 1,2,...,k. (The converse is not true.) This is easily seen if we 
factor a(z) with respect to its real and complex zeros: 


a(z) = aez] [@—15) | [e- Go + iyo) le - Ho —ivo)], ae # 0. 
p o 
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Simplifying, we can write 
a(z) = az] ]& —Tp) I [k — Xo)" + yo]. 
p oO 


Since by assumption, rp < 0, x5 < 0, all nonzero coefficients of a(z) have the 
sign of ag, and ae > O since a(1) = 1. 
(iv) Let 
Z 


1 
ia +z 


=o tAge +adgzt te. 


1—z 
Then 
ApS 3) op <0 for P= 1,2, 3.225 
The proof of this is deferred to the end of this section. 


Suppose now that 


ah =D 2G, 4 Gone BHP Gomds. (6.153) 
1+2z 
In 
1-z 
For maximum order p, we must take 
b(z) = by + biz +++ + et (6.154) 


in (6.151). The problem at hand then amounts to determining how many of the 
coefficients by41, be42,... in (6.153) can vanish simultaneously if a(z) is restricted 
to have all its zeros in Rez < 0 and only simple zeros on Rez = 0. 

Expansion (6.153), in view of (iv), is equivalent to 


(Ag + Azz? + Agzt +++ ay + aozta3z e+ a,z!) 
= bo tbhizth?et::. 
Comparing coefficients of like power on the right and left, we get 
bo = Ava, 
by = doar, 


boy = Apdry41 + Ardy] Fees HAna 
pene ep oe gee PON ag 2 FH 158) 
boy4+1 = Andav42 + Ardy + +++ + Aayar 


where for convenience we assume a, = O if 4 > k. We distinguish two cases. 


Case 1. k is odd. Then by (6.155), 


bai = Aodk+2 + Arag + Agag—o +++ + ARGIG. 
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Since ay42 = 0 by convention, Az, < 0 by (iv), and a; > 0, a, > O by (iii), it 
follows that by4, < 0. Thus, p = k + 1 is the maximum possible order. 
Case 2. k is even. Here (6.155) gives 

be+1 = Aoak+2 + Arak + Agap—2 se AKar. 


Again, ag42 = 0, Ao, < 0, anda; > 0, so that by 41 < 0. We have b,4; = O if and 
only if a7 = a4 =-++: = ax = 0, that is, 


a(—z) = —a(z). (6.156) 


Since a, = 0, we conclude from (ii) that a(¢) has a zero at € = —1. A trivial 
zero is € = 1. In view of (6.156), the polynomial a(z) cannot have zeros off the 
imaginary axis without violating the root condition. Therefore, a(z) has all its zeros 
on Rez = 0, hence @(€) all its zeros on |¢| = 1. Conversely, if @(¢) has zeros at 
¢€ = +1 and all other zeros on |¢| = 1, then a, = 0, and 


a(z) = a-iz] [[(e—iyy)(e + ivy] = ax-iz] [© + 7): 
Y Y 


that is, a(z) is an odd polynomial and, therefore, b.+; = 0. 

This proves the second half of part (b) of Theorem 6.4.2. To complete the proof, 
we have to show that bg42 = 0 is impossible. This follows again from (6.155), if 
we note that (for k even) 


bea2 = Agak43 + AradK+1 + Agag—1 + +++ + Anya 


= Agay—-1 +++ + Axara <0 


for the same reason as in Case 1. Oo 
It remains to prove the crucial property (iv). Let 


f(g i= es = Ap + Age? + Aget pene (6.157) 
In 


1-z 


The fact that Ag = 5 follows easily by taking the limit of f(z) as z > 0. By 
Cauchy’s formula, 


Aw = 


= — = ——; v>1, 
Qmi Jo vt! 271 


1 f(@dz_ 1 p dz 
a aria 1+ z° 
1—z 
where C is acontour encircling the origin in the positive sense of direction anywhere 


in the complex plane cut along (—oo, —1) and (1, co) (where f in (6.157) is one- 
valued analytic). To get a negativity result for the A, one would like to push the 
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contour as close to the cuts as possible. This is much easier to do after a change of 
variables according to 


1 
u=-. 
z 
Then 
1 ty 
iia li = du, (6.158) 
21 Jr u—1 


where the cut in the u-plane now runs from —1 to 1, and T is a contour encircling 
this cut in the negative sense of direction. By letting I’ shrink onto the cut and 
noting that 


1 1 
pel Oe as —int asu>x+i0, —l<x<l, 
u— 1-x 
whereas 
1 1 
pea eee +iz asu>x-i0, -—l<x<l, 


u— 1-x 


we find in the limit 


1 : = dx =A _ dx 
Aw = -z / a > +f x * 
2mi | J_4 In j= — ix 1 Int oe ee 


(6.159) 


| 
ine 
os 
» 
| 
N 
Q 
ce 
A 
SS: 
<S 
V 


as was to be shown. 
We also note from (6.159) that 


m2 2v—1 


1 
Ayn = o(<) as v > OOo. 
v 


We now recall a theorem of Littlewood, which says that if 


1 ! 2 1 
Dol << f x7’ dx = — —_,_ v = 1,2,3,..., 
mw” J-1 


that is, 


f@= Do Anz 


n=0 


is convergent in |z| < 1 and satisfies 


it 
f(x) > s as x F 1, i= 0(-) asn—>o, 
n 
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then 
[o,@) 
yor = 5. 
n=0 
In our case, 
x 
I= —q.g  ! as x t 1 
In 
1-x 
and so 
[o,@) [o,@) 1 
daw =0, thatis, D/A» =— 5. (6.160) 
v=0 v=1 


For an application of (6.160), see Ex. 11(b). 


6.4.3 Applications 


Theorem 6.4.2 and its proof technique have a number of interesting consequences, 
which we now discuss. 


Theorem 6.4.3. For every stable k-step method of maximum order p (= k + 1 or 
k + 2), we have that 


lp+1 
p41 <0, Cyp = —— <0. (6.161) 
DiaoPhe 
Proof. We recall from (6.131) and (6.132) that 
c 
Epti =Cp, Crp = ~. 
Co 
where cp = A(1) and 
a(g) 
6(f) = inp =c,(F—-I)? +--+, cp FO. 
With the transformation 
14+z 2z 
c= , €¢-l= — =274+ --- 
ave 1-—z 


used in the proof of Theorem 6.4.2, we get for the function d(z) in (6.151) 


la)“ PoP 
> d(z) =2 CpZ = ee 
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or 
A(z) = 2? *eyeP +e, 
On the other hand, by (6.153) and (6.154), 
d(z) = byz? +e. 
Therefore, cp = 2°? hb, and 
bp 

By: 
From the proof of Theorem 6.4.2, we know that b, < 0. This proves the first relation 


in (6.161). 
To prove the second, we must show that 


ia? by, Cg =? (6.162) 


Bl) > 0. (6.163) 


Since p => 1, we have 


k k 
pO) = >" r= > se, =a"): 
s=0 s=0 


By the root condition, a’ (1) 4 0. If we had a’(1) < 0, then a(1) = 0 in conjunction 
with a(t) ~ t* as t — oo would imply that a(t) vanishes for some t > 1, 
contradicting the root condition. Thus, «’(1) > 0, proving (6.163). O 


We note, incidentally, from (6.162), since B(1) = 2*b(0) = 2*by (cf. (6.152) 
and (6.154)), that 


b 
Cip = 277 . (6.164) 
bo 
Theorem 6.4.4. (a) Let k > 3 be odd, and 
Ck = inf ICrk+il; (6.165) 


where the infimum is taken over all stable k-step methods of (maximum) order 

p=k+1.Ifk = 5, there is no such method for which the infimum in (6.165) 

is attained. If k = 3, all three-step methods of order p = 4 with a(¢) = 

C—D)E+ YE —A), -1 <A <1, have |C3.4| = ¢3 = 
(b) Let k > 2 be even, and 


ele 
180 


cx = inf |Cy esol, (6.166) 


where the infimum is taken over all stable k-step methods of (maximum) order 
p=k4+2. Ifk = 4, there is no such method for which the infimum in (6.166) 


is attained. If k = 2, Simpson’s rule is the only two-step method of order p = 4 
. a ees | 
with |C24| HO = Tso° 
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Proof. (a) By (6.164) we have 


— 2-&k+)) be+1 
bo” 


From the proof of Theorem 6.4.2 (cf. (6.155)), we have 


Cr k+1 


bo = Aods = $41, Dep1 = Ardg + Agag—2 + +++ + Anpiai <0. 


So, 
1 |Ax+al 
ICik+il = =— (Aslan + |Aalag—2 + +++ + [Axsilar) = ——. (6.167) 
Kay 2k 
We claim that 
. Ak 
C= inf ICrk+1\ = | cl 
Indeed, take 
(o— ei) — Xa) — Ei) 
a(z) =z 


(1 — x4)(1 — x2)+++ (1 — xe-1)’ 
where the x; are distinct negative numbers. Then a(1) = 1 (as must be), and a(z) 
satisfies the root condition. Now let x; — —oo (alli), then a(z) — z, that is, 


a, —>1, a; > 0 for s =2,3,...,k. 


Therefore, by (6.167), 
[Acti 
Qk 
Now suppose that |Cx.441| = cx for some stable method. Then, necessarily, 


IC k+1| > 


ak = Ak-2 = +++ = a3 = O,7 


that is, 
a(z) = ayzt+ aoe + agzt + +--+ ayo. (6.168) 


In particular (cf. Sect. 6.4.2, (ii)), € = —1 is a zero of a(¢) and ax_; 4 0 by the 
stability requirement. We distinguish two cases: 


Case 1. k = 3. Here, 
a(z) = az + ov = 2a) + pz). 


By stability, a(z) has a zero anywhere on the negative real axis and a zero at z = 0. 
Transforming back to ¢ in the usual way (cf. Sect. 6.4.2, (6.150)), this means that 


a(f)=(€-1)(€+ 1)(€-A), -1<A<1. 
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All these methods (of order p = 4) by construction have 


— |Aat od 
IC34| = ¢c3 = = ay 


Case 2. k > 5. If z; are the zeros of a(z) in (6.168), then by Vieta’s rule, 


hence 


Since by stability, Re z; < 0 for alli, we must have Rez; = 0 for all i. This means 
that 


a(z) = const - z [[@ + y;) 
J 
is an odd polynomial, so dg = a4 = ++: = ax-1 = O, contradicting stability 
(¢ = —1 is a multiple zero). 


(b) The proof is similar to the one in case (a), and we leave it as an exercise for the 
reader (see Ex. 13). GI 


It is interesting to observe that if in the infimum (6.165) of Theorem 6.4.4 we 
admit only methods whose characteristic polynomials a(¢) have the zero ¢; = 1 
and all other zeros bounded in absolute value by y < 1 (and hence are stable), then 
it can be shown that the infimum is attained precisely when 


a(dy=(€-Y)E+y) 1, (6.169) 


that is, all zeros other than ¢ = 1 are placed at the point € = —y farthest away from 
€ = 1. Moreover, there are explicit expressions for the minimum error constant. For 
example, if k is odd, then (cf. Ex. 12) 


' _ k-1 k-1 
min |Cyx41| = 2 ; Actil + ( 5 Jaton ( ‘ji Jace’ 


Scop eek rai ; (6.170) 
where 
l=¥ 
ee 6.171 
l+y ( ) 
and Az, A4,... are the expansion coefficients in (iv) of the proof of Theorem 6.4.2. 


If y = 0, we of course recover the Adams—Moulton formulae. 
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6.5 Stiff Problems 


In Sect.6.4 we were concerned with multistep methods applied to problems on 
a finite interval; however, the presence of “stiffness” (i.e., of rapidly decaying 
solutions) requires consideration of infinite intervals and related stability concepts. 
These again, as in the case of one-step methods, are developed in connection with 
a simple model problem exhibiting exponentially decaying solutions. The relevant 
concepts of stability then describe to what extent multistep methods are able to 
simulate such solutions, especially those decaying at a rapid rate. It turns out that 
multistep methods are much more limited in their ability to effectively deal with 
such solutions than are one-step methods. This is particularly so if one requires 
A-stability, as defined in the next section, and to a lesser extent if one weakens the 
stability requirement in a manner described briefly in Sect. 6.5.2. 


6.5.1 A-Stability 


For simplicity we consider the scalar model problem (cf. Chap. 5, (5.166)) 


d 
5S = Ay, 0<x<0co, Rel <0, (6.172) 
X 


all of whose solutions decay exponentially at infinity. In particular, 
y(x) > 0 as x > CO (6.173) 


for every solution of (6.172). 
Definition 6.5.1. A multistep method (6.2) is called A-stable if, when applied to 
(6.172), it produces a grid function {u,}°°., satisfying 


Uu, >~>O0 ano, (6.174) 


regardless of the choice of starting values (6.4). (It is assumed that the method is 
applied with constant grid length h > 0.) 
We may assume (cf. Sect. 6.4.1) that the multistep method is irreducible; that is, 
the polynomials a(t) and f(t) defined in (6.122) have no common zeros. 
Application of (6.2) to (6.172) yields 


k k 
* AsUn+s — hr » Bstn+s = 0, (6.175) 
s=0 s=0 


a constant-coefficient difference equation of order k whose characteristic polyno- 
mial is (cf. Sect. 6.3.1) 


a(t) =a(t)—hB(t), h=hreC. (6.176) 
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All solutions of (6.175) will tend to 0 as n — oo if the zeros of & are all strictly less 
than | in absolute value. The multistep method, therefore, is A-stable if and only if 


{a(€) = 0, Red < 0} implies |¢| < 1. 
This is the same as saying that 
{a(f) =0, |f| => 1} implies Reda > 0. 


But @(¢) = 0 implies 6(¢) ¥ 0 (since otherwise a(¢) = B(¢) = 0, contrary to the 
assumed irreducibility of the method). Thus @(¢) = 0 implies 


: a(6) 
h=hrj = —-, 
: BO) 


and A-stability is characterized by the condition 


a(S) 


© aq 2 > 0 if |¢|> (6.177) 


Theorem 6.5.1. /f the imide’ 4 method (6.2) is A-stable, then it has order p = 2 
and error constant Cx,» < —7. The trapezoidal rule is the only A-stable method 
for which p = 2 and Cy,p = be 


12° 


Proof. From (6.135) and (6.134) we have for any k-step method of order p 


BE ~ BG) = Cu l= 1" + 
which, after division by a(€) = a@/(1I)(E-1) +--- = BAYES -Dt+-::- = 
co(E — 1) +--- gives 

1 BG) i 

Int a) 7 ee foe, (6.178) 


For the trapezoidal rule, having ar(¢) = ¢ —1 and Br(¢) = $(¢ + 1), one 
easily finds 


Ll. “prtg) _. 1 
ae aan pei (6.179) 
Letting 
BS) Br) 
AO = oO ar®’ ae 


one obtains from (6.178) and (6.179) by subtraction 


ag) =-(e+ 55] (¢—1+---), (6.181) 
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where 
Cyp if p=2, 
hae ° / (6.182) 
O if p>2. 
Given that our method is A-stable, we have from (6.177) that Re[a(¢)/B(¢)] = 0 if 
|¢| = 1, or equivalently, 


Re > 0 if |¢| > 1. 
On the other hand, 
Br(S) 
R =O if = 1. 
ar (6) if (6 


It follows from (6.180) that Re A(¢) > 0 on |¢| = 1, and by the maximum principle 
applied to the real part of A(), since A(¢) is analytic in |¢| > 1 (there are no zeros 
of w(¢) outside the unit circle), that 


Re A(¢) > 0 for |¢| > 1. (6.183) 


Now putting ¢ = 1 + ¢, Ree > 0, we have that |¢| > 1, and therefore, by (6.181) 
and (6.183), for |e| sufficiently small, that 
c+ 4 <0. 

If p > 2, this is clearly impossible in view of (6.182), whereas for p = 2 we 
must have Cy, < —t. This proves the first part of Theorem 6.5.1. To prove the 
second part, we note that p = 2 and C;,p = ot imply A(¢) = O((¢ — 1)”), and 
taking € = 1 + € as previously, the real part of (¢ — 1), and in fact of any power 
(¢ — 1)4, q = 2, can take on either sign if Ree > 0. Consequently, by (6.183), 
A(¢) = 0 and, therefore, the method being irreducible, it follows that a(¢) = ar(¢) 
and B(¢) = Br(). O 


6.5.2 A(cc)-Stability 


According to Theorem 6.5.1, asking for A-stability puts multistep methods into a 
straitjacket. One can loosen it by weakening the demands on the region of absolute 
stability, that is, the region 


Dy={heC: &(f) =0 implies |¢| < 1}. (6.184) 


A-stability requires the left half-plane Re h < 0 to be contained in D ‘4. In many 
applications, however, it is sufficient that only part of the left half-plane be contained 
in D4, for example, the wedge-like region 


7 : e 1 
W.={heC: |arg(—h)| <a, h40}, 0<a< 5 (6.185) 


This gives rise to the following definition. 
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Table 6.1 A(q@)-stable kth-order backward differentiation methods 
k 1 2 3 4 5 6 
a 90° 90° 86.03° 73.35° 51.84° 17.84° 


Definition 6.5.2. The multistep method (6.2) is said to be A(a)-stable,0 <a < 
57, if Wy C Dag. 


There are now multistep methods that have order p > 2 and are A(a)-stable 
for suitable w, the best known being the kth-order backward differentiation method 
(6.147). This one is known (cf. Ex. 10) to be stable in the sense of Sect. 6.3.2 for 
1 < k < 6 (but not for k > 6) and for these values of k turn out to be also 
A(a)-stable with angles a as shown in Table 6.1. They can therefore be effectively 
employed for problems whose stiffness is such that the responsible eigenvalues are 
relatively close to the negative real axis. 

In principle, for any given a < 51, there exist A(a)-stable k-step methods of 
order k for every k = 1,2,3,..., but their error constants may be so large as to 
make them practically useless. 


6.6 Notes to Chapter 6 


Most texts mentioned in the Notes to Chap. 5 also contain discussions of multistep 
methods. A book specifically devoted to the convergence and stability theory of 
multistep methods is Henrici [1977]. It also applies the statistical theory of roundoff 
errors to such methods, something that is rarely found elsewhere in this context. A 
detailed treatment of variable-step/variable-order Adams-type methods is given in 
the book by Shampine and Gordon [1975]. 


Section 6.1.2. The choice of exponential sums as gauge functions in Q, is 
made in Brock and Murray [1952]; trigonometric polynomials are considered in 
Quade [1951] and Gautschi [1961], and products, respectively sums, of ordinary 
and trigonometric polynomials in Stiefel and Bettis [1969] and Bettis [1969/1970] 
(also see Stiefel and Scheifele [1971], Chap. 7, Sect. 24]. 


Section 6.2.1. Adams described his method in a chapter of the book Bashforth and 
Adams [1883] on capillary action. He not only derived both the explicit and implicit 
formulae (6.52) and (6.58) but also proposed a scheme of prediction and correction. 
He did not predict pee by the first formula in (6.64) but rather by the backtracking 
scheme described at the end of Sect. 6.2.2. He then solved the implicit equation by 
what amounts to Newton’s method, the preferred method nowadays of dealing with 
(mildly) stiff problems. 


Section 6.2.2. Moulton describes his predictor—corrector method in Moulton 
[1926]. He predicts exactly as Adams did, and then iterates on the corrector formula 
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as described in the text, preferably only once, by taking the step sufficiently small. 
No reference is made to Adams, but then, according to the preface, the history of 
the subject is not one of the concerns of the book. 


Section 6.2.3. Milne’s suggestion of estimating the truncation error in terms of the 
difference of the corrected and predicted value is made in Milne [1926]. 

The first asymptotic formula in (6.72) is due to Spital and Trench, the second to 
Steffensen; for literature and an elementary derivation, see Gautschi [1976b]. 


Section 6.3.1. A basic and elegant text on linear difference equations is 
Miller [1968]. Scalar difference equations such as those encountered in this 
paragraph are viewed there as special cases of first-order systems of difference 
equations. A more extensive account of difference equations is Agarwal [2000]. 
Readers who wish to learn about practical aspects of difference equations, 
including applications to numerical analysis and the sciences, are referred to 
Wimp [1984], Goldberg [1986], Lakshmikantham and Trigiante [2002], and Kelley 
and Peterson [2001]. 


Section 6.3.2. As in the case of one-step methods, the stability concept of 
Definition 6.3.1, taken from Keller [1992, Sect. 1.3], is also referred to as zero 
stability (i.e., for h — 0), to distinguish it from other stability concepts used 
in connection with stiff problems (cf. Sect.6.5). The phenomenon of (strong) 
instability (i.e., lack of zero stability) was first noted by Todd [1950]. Simple roots 
ts # 1 of the characteristic equation (6.78) with |f;| = 1, although not violating 
the root condition, may give rise to “weak” stability (cf. Henrici [1962, p. 242]). 
This was first observed by Rutishauser [1952] in connection with Milne’s method 
— the implicit fourth-order method with a(t) = ¢? — 1 (Simpson’s rule) — although 
Dahlquist was aware of it independently; see the interesting historical account of 
Dahlquist [1985] on the evolution of concepts of numerical (in)stability. 

Theorem 6.3.3 is due to Dahlquist [1956]. The proof given here follows, with 
minor deviations, the presentation in Hull and Luxemburg [1960]. 


Section 6.3.3. There is a stability and convergence theory also for linear multistep 
methods on nonuniform grids. The coefficients a; = @sn, Bs = Bs» in the k-step 
method (6.2) then depend on the grid size ratios a = h;/hi-1,i =n+k—1, 
n+k—2,...,n+1.If the coefficients a; ,,, Bs are uniformly bounded, the basic 
stability and convergence results of Sects. 6.3.2 and 6.3.3 carry over to grids that are 
quasiuniform in the sense of having ratios h,, /h,—, uniformly bounded and bounded 
away from zero; see, for example, Hairer et al. [1993, Chap. 3, Sect. 5]. 


Section 6.3.4. Theorem 6.3.5 is due to Salihov [1962] and, independently, with 
a refined form of the O(h?*') term in (6.107), to Henrici [1962, Theorem 5.12], 
[1977, Theorem 4.2]. 

Result (6.107) may be interpreted as providing the first term in an asymptotic 
expansion of the error in powers of h. The existence of a full expansion has been 
investigated by Gragg [1964]. In Gragg [1965], a modified midpoint rule is defined 
which has an expansion in even powers of / and serves as a basis of Gragg’s 
extrapolation method. 
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Section 6.3.5. The global validity of the Milne estimator, under the assumptions 
stated, is proved by Henrici [1962, Theorem 5.13]. The idea of Theorem 6.3.6(5) is 
also suggested there, but not carried out. 

Practical codes based on Adams predictor—corrector type methods use variable 
steps and variable order in an attempt to obtain the solution to a prescribed accuracy 
with a minimum amount of computational work. In addition to sound strategies 
of when and how to change the step and the order of the method, this requires 
either interpolation to provide the necessary values of f relative to the new step, 
or extensions of Adams multistep formulae to nonuniform grids. The former was 
already suggested by both Adams and Moulton. It is the latter approach that is 
usually adopted nowadays. A detailed discussion of the many technical and practical 
issues involved in it can be found in Chaps. 5-7, and well-documented Fortran 
programs in Chap.10, of Shampine and Gordon [1975]. Such codes are “self- 
starting” in the sense that they start off with Euler’s method and a very small step, 
and from then on let the control mechanisms take over to arrive at a proper order and 
proper step size. A more recent version of Shampine and Gordon’s code is described 
in Hairer et al. [1993, Chap. 3, Sect. 7]. So are two other codes, available on Netlib, 
both based on Nordsieck’s formulation (cf. Notes to Chap. 5, Sect. 5.4) of the Adams 
method, one originally due to Brown et al. [1989] and the other to Gear [197Ic]. 


Section 6.4.1. Pairs of multistep methods, such as those considered in the second 
Example of this section, having the same global error constants except for sign, 
are called polar pairs in Rakitskii [1961, Sect. 2]. They are studied further in 
Salihov [1962]. The backward differentiation formula (6.147), withk = lork = 2, 
was proposed by Curtiss and Hirschfelder [1952] to integrate stiff equations. The 
method with values of k up to 6, for which it is stiffly stable (Gear [1969]), is 
implemented as part of the code DIFSUB in Gear [1971b,c] and subsequently in the 
“Livermore solver” LSODE in Hindmarsh [1980]. For a variable-step version of the 
backward differentiation formulae, see Hairer et al. [1993, p. 400]. 


Section 6.4.2. Theorem 6.4.2 is a celebrated result of Dahlquist proved in 
Dahlquist [1956] and extended to higher-order systems of differential equations 
in his thesis, Dahlquist [1959]. The k-step method of maximum order p = 2k is 
derived in Dahlquist [1956, Sect. 2.4]. The proof of (iv) based on Cauchy’s formula 
is from Dahlquist [1956, p. 51]. A more algebraic proof is given in Henrici [1962, 
p. 233]. For the theorem of Littlewood, see Titchmarsh [1939, Sect. 7.66]. 


Section 6.4.3. The result of Theorem 6.4.4(a) was announced in Gautschi [1963]; 
case (b) follows from Problem 37 in Henrici [1962, p. 286]. For the remark at the 
end of this paragraph, see Gautschi and Montrone [1980]. 


Section 6.5. The standard text on multistep methods for stiff problems is again 
Hairer and Wanner [2010, Chap. 5]. It contains, in Chap. 5, Sect. 5, references to, 
and numerical experiments with, a number of multistep codes. 


Section 6.5.1. Theorem 6.5.1 is due to Dahlquist [1963]. The proof follows Hairer 
and Wanner [2010, Chap. 5, Sect. 1]. There are basically two ways to get around 
the severe limitations imposed by Theorem 6.5.1. One is to weaken the stability 
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requirement, the other to strengthen the method. An example of the former is A(@)- 
stability defined in Sect. 6.5.2; various possibilities for the latter are discussed in 
Hairer and Wanner [2010, Chap. 5, Sect. 3]. Order star theory [ibid., Sect. 4], this 
time on Riemann surfaces, is again an indispensable tool for studying the attainable 
order, subject to A-stability. 


Section 6.5.2. The concept of A(q@)-stability was introduced by Widlund [1967]. 
For Table 6.1 as well as for the remark in the final paragraph, see Hairer and 
Wanner [2010, Chap. 5, Sect. 2]. 

The regions D4 of absolute stability for explicit and implicit k-step Adams 
methods, as well as for the respective predictor—corrector methods, are depicted for 
k = 1,2,...,6in Hairer and Wanner [2010, Chap. 5, Sect. 1]. As one would expect, 
they become rapidly small, more so for the explicit method than for the others. More 
favorable are the stability domains for the backward differentiation methods, which 
are also shown in the cited reference and nicely illustrate the results of Table 6.1. 

As in the case of one-step methods, there is a theory of nonlinear stability 
and convergence also for multistep methods and their “one-legged” companions 


er AsUnts =hf (= BsXn+s; o BsUn4 i The theory is extremely rich 
in technical results, involving yet another concept of stability, G-stability, for which 
we refer again to Hairer and Wanner [2010, Chap. 5, Sects. 6-9]. 


Exercises and Machine Assignments to Chapter 6 


Exercises 


1. Describe how Newton’s method is applied to solve the system of nonlinear 
equations 


k-1 k-1 
Untk = NB. f (Xn+k; Un+k) + &n, 82, = h > Bs fa+s — Yo stints 


s=0 s=0 


for the next approximation, u,+,. 
2. The system of nonlinear equations 


Untk = Bx f (Xn+e; Un+k) + n> Bx x 0, 


arising in each step of an implicit multistep method (cf. (6.5)) may be solved 
by 


e Newton’s method; 

¢ the modified Newton method (with the Jacobian held fixed at its value at the 
initial approximation); 

¢ the method of successive approximations (fixed point iteration). 
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Assume that f has continuous second partial derivatives with respect to the 
u-variables and the initial approximation ull , satisfies ah = Untk + O(8) 
for some g > 0. 

(a) Show that the vth iterate re in Newton’s method has the property that 
ce — Unte = Oh"), where r, = 2°(g + 1) — 1. Derive analogous 
statements for the other two methods. 

(b) Show that g = 1, if one takes ae = Un+k—1- 

(c) Suppose one adopts the following stopping criterion: quit the iteration after 
ju iterations, where jz is the smallest integer v such thatr,, > p +1, where p 
is the order of the method. For each of the three iterations, determine jz for 
p = 2,3,...,10. (Assume g = 1 for simplicity.) 

(d) If g = p, what would yu be in (c)? 


3. (a) Consider an explicit multistep method of the form 


Un+2 — Unj—2 + O(Un+1 = Un—1) = A[B(fn41 =F Sn-1) + vIn). 


Show that the parameters a, 6, y can be chosen uniquely so that the 
method has order p = 6. {Hint: to preserve symmetry, and thus algebraic 
simplicity, define the associated linear functional on the interval [—2, 2] 
rather than [0, 4] as in Sect. 6.1.2. Why is this permissible?} 
(b) Discuss the stability properties of the method obtained in (a). 
4. For the local error constants y;, y;* of, respectively, the Adams—Bashforth and 
Adams—Moulton method, prove that 


ly~l < : for k > 2 
or ; 
Vk ae Vk 2 
5. For the local error constants y;, yz of, respectively, the Adams—Bashforth and 
Adams—Moulton method, show that, as k > co, 


1 1 : 1 1 
Vk = ink [i+ 0(zz)]. Ve = Fiat k 1+ 0( 2) 


{Hint: express the constants in terms of the gamma function, use 


DR+t) _ t=I | 
reat =F [+0 (z) pee 


and integrate by parts.} 

6. Consider the predictor—corrector method using the Adams—Bashforth formula 
as predictor and the Adams—Moulton formula (once) as corrector, both in 
difference form: 
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k-1 
° 
Un+tk = Un+k—-1 + h 2 ysV° fn+k-1, 


s=0 
k-1 


° 
Un+k = Un+k—-1 + h > WOW” Pik 
s=0 
Fntk = f nth, Untk), 


° 3 ° ° 
where Fetk = S (Xn+k, Un+k)s Vinde = Sth — Snt+k-1, and so on. 


(a) Show that 


° ° 
k 
Untk = Untk + hye iV Stk: 


{Hint: first show that y* = y; — ys—1 for s = 0,1,2,... , where y_; is 
defined to be zero.} 
(b) Show that 


k-1 
° fe} 
k ‘ 
¥ Stk = Face _ ye V* fntk-1- 
s=0 
ae : wa: . m otf) — faty+l 
{Hint: use the binomial identity )°”"_) (°7/) = ( 7a )} 
7. Prove that the predictor—corrector method 
k-1 k-1 
fe} 
Untk = — ¥ AsUnts + h 2 Bs fats, 
s=0 s=0 
k-1 k-1 
fe} 
Untk = y OF Un +s + h BEF (Xn+k, Un+k) + > B* Snes ry 
s=1 s=1 
Sntk = f Xn+k,Un+k) 


is stable for every f © F (cf. (6.87), (6.88)), if and only if its characteristic 
polynomial a* (t) = ye a*t*, a5 = 0, satisfies the root condition. 

8. Let a(f) = w(S)ao(), B(S) = w()Bo(G), and suppose {u,,} is a solution of 
the difference equation (6.2) corresponding to {a, Bo}. Show that {w,,} also 
satisfies the difference equation (6.2) corresponding to {a, B}. 

9. Construct a pair of four-step methods, one explicit, the other implicit, both 
having a(¢) = ¢+ — © and order p = 4, but global error constants that are 
equal in modulus and opposite in sign. 

10. (a) Compute the zeros of the characteristic polynomial a(t) of the k-step 
backward differentiation method (6.147) for k = 1(1)7 and the modulus of 
the absolutely largest zero other than 1. Hence, confirm the statement made 
at the end of Sect. 6.4.1. 

(b) Compare the error constant of the k-step backward differentiation method 
with that of the k-step Adams—Moulton method for k = 1(1)7. 
11.(a) Show that the polynomial b(z) in (6.152), for an explicit k-step method, 
must satisfy b(1) = 0. 
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(b) Use the proof techniques of Theorem 6.4.2 to show that every stable explicit 
k-step method has order p < k. {Hint: make use of (6.160).} 

12. Determine min |C;%+1|, where the minimum is taken over all k-step methods 
of order k + 1 whose characteristic polynomials have all their zeros ; (except 
¢; = 1) inthe disk Tr, = {z € C: |z| < y}, where y is a prescribed number 
with 0 < y < 1. {Hint: use the theory developed in Sects. 6.4.2 and 6.4.3.} 

13. Prove Theorem 6.4.4(b). 


Machine Assignments 


1. This assignment pertains to an initial value problem for a scalar first-order 
differential equation. 


(a) A kth-order Adams—Bashforth predictor step amounts to adding / times 
a linear combination ee YsV° fntk—1 Of k backward differences to the 
last approximation Wass = Un+k—1. Write a Matlab routine AB .m imple- 
menting this step for k = 1:10. Use Maple to generate the required 
coefficients ys. Take as input variables the number mast, the k-vector 
F =([fi, fnti,---, fr+k—1] of k successive function values, k, and h. 

Do the same as in (a) for the kth-order Adams—Moulton corrector step (in 
Newton’s form), writing a routine AM.m whose input is Wjast = Un+k—1, the 

° 


vector FO = [fn4i. fnt2.---. Sntk—1s fn +x), k, and h. 

Use the routines in (a) and (b) to write a routine PECE.m implementing 
the PECE predictor/corrector scheme (6.64) based on the pair of Adams 
predictor and corrector formulae: 


(b 


wm 


(c 


Nee 


k-1 
° 
P: Unt+k = Un+k—1 + h ~ ¥sV* fntk—1 
s=0 
2 ° 
E: f = F(Xn+k; Un+k)s 
k= 5 
Ci Untk = Unte-1 th be Vo WF ei 
s=0 
E: Sntk = I (Xn+k; Un+k)s 


where Viace = Tekh ~ Sn+k-15 a ae = VV ive = ee ~~ 
2fr+k—-1 + fn+k—2, etc. As input parameters include the function f, the 
initial and final values of x, the & initial approximations, the order k, the 
number N of (equally spaced) grid intervals, and the values of n + k at 
which printout is to occur. 


2. (a) Consider the initial value problem 


dy _ 1 


= —— ., y(0)= 0, 0< x < 2m, O<e<l. 
dx 1—ecosy 
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Show that the exact solution y = y(x) is the solution of Kepler’s equation 
y—esiny—x = 0 (cf. Chap. 4, Ex. 21 and Chap. 5, MA 5(a) and (c)). What 
is y(t)? What is y(27)? 

Use the routine AB.m of MA I(a) to solve the initial value problem by the 
Adams-Bashforth methods of orders k = 2,4,6, with N = 40, 160, 640 
integration steps of length h = 27/N.Atx = ST, I, Sm, 2m, 1.e., form + 
k= iN, $N ,2N,N, print the approximations u,,4+, obtained, along with 
the errors err = uy+4, — y(x) and the scaled errors err/h*. Compute y(x) by 
applying Newton’s method to solve Kepler’s equation. 


(c) Do the same as in (b) with the PECE.m routine of MA I(c). 


(b 


wm 


In both programs start with “exact” initial values. According to (6.107) and 
the remarks at the end of Sect. 6.3.4, the scaled errors in the printout should 
be approximately equal to C,,e(x) resp. Cf,e(x), where Cyx, Cx, are the 
global error constants of the kth-order Adams—Bashforth resp. the kth-order 
Adams predictor/corrector scheme, and x = i, I, Sm, 2. Thus, the errors 
of the predictor/corrector scheme should be approximately equal to (C*,./Cx.«) 
times the errors in the Adams—Bashforth method. Examine to what extent this is 
confirmed by your numerical results. 

3. Use the analytic characterization of order given in Theorem 6.4.1, in conjunction 
with Maple’s series expansion capabilities, to: 


(a) determine the coefficients ian ae of the kth-order Adams—Bashforth 
method (6.48) fork = 1: 10; 


(b) determine the coefficients feet 4 of the kth-order Adams—Moulton 
method (6.56) fork = 1: 10. 
4. (a) Write a Matlab routine for plotting the regions D4 of absolute stability 


for the kth-order Adams—Moulton methods, k = 3 : 10. {Hint: seek the 
boundaries of the regions D4 in polar coordinates.} In particular, compute 
the abscissae of absolute stability on the negative real axis. 

(b) Do the same for the Adams (PECE) predictor/corrector method. Compare 
the stability properties of this predictor/corrector method with those of the 
corrector alone. 

5. Consider the (slightly modified) model problem 
& = —wly—a(x), OS x <1; yO) = yo, 
dx 

where w > 0 and (i) a(x) = x”, yo = 0; (ii) a(x) = e&*, yo = 1; iii) a(x) = 

e,yo=l. 


(a) In each of the cases (i)—(iii), obtain the exact solution y(x). 

(b) In each of the cases (i)—(iii), apply the kth-order Adams predictor/corrector 
method, fork = 2 : 5, using exact starting values and step lengths h = 
oh a a: = . Print the exact values y, and the errors u, — y, for x, = 
0.25, 0.5, 0.75, 1. Try a = 1,@ = 10, and w = 50. Summarize your results. 
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(c) Repeat (b), but using the k-step backward differentiation method (6.147) (in 
Lagrangian form). 


6. Consider the nonlinear system 


d 
ne ee 2yi(0. — y2), yi (0) = 1, 
dx 
0<x < 10, 
dy 
ae —y2(1— yi), y2(0) = 3, 
x 


of interest in population dynamics. 


(a) Use Matlab’s ode45 routine to plot the solution y = [y1, y2] of the system 
to get an idea of its behavior. Also plot the norm of the Jacobian matrix fy 
along the solution curve to check on the stiffness of the system. 

Determine a step length h, or the corresponding number N of steps, in 

the classical Runge-Kutta method that would produce about eight correct 

decimal digits. {Hint: for N = 10, 20, 40, 80,... compute the solution with 

N steps and 2N steps and stop as soon as the two solutions agree to within 

eight decimal places at all grid points common to both solutions. For the 

basic Runge-Kutta step, use the routine RK4 from Chap. 5, MA 1(a).} 

(c) Apply N = 640 steps of the pair of fourth-order methods constructed 
in Ex. 9 to obtain asymptotically upper and lower bounds to the solution. 
Plot suitably scaled errors uy, — Yn, Uy — Yn, n = 1(1)N, where y, is 

the solution computed in (b) by the Runge-Kutta method. For the required 

initial approximations, use the classical Runge—Kutta method. Use Newton’s 

method to solve the implicit equation for u*. 


(b 


wm 
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2. Write the system of equations more simply as 


u= gu), plu) = hb. f (Xn+e,u) + g,, 


where g,, is a constant vector not depending on u, and denote the solution 
by u*. 


(a) Newton’s method applied to u — g(u) = 0 can be written as 
yer — ybl_— (I = g,(ul'l)) (ul! = g(u")) , v=0,1,2,..., 


where g,, is the Jacobian of g. Multiplying through by I — g, (ul”!) and 
rearranging gives 


yet — g(u'’!) 4 g,(u") (ul’*4) _ ul!) (Newton). 
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Likewise, for modified Newton, 
wet = g(ul!) +, (u'"!) (ul!) — ul!) (modified Newton). 

Successive approximations, finally, gives simply 

u’*) = g(ul’!) (successive approximations). 
Now, by Taylor expansion at u* (cf. Chap. 5, Sect. 5.6.4, (5.58)), we have 

(ul!) = g(u*) + oyu" ev + Fe; Pun Mev, 

pull) = @,(u*) + 2 eloulu" et, 
k 

where 


(3) ey ull a 


and e* is the kth component of ey. 
Consider first Newton’s method. Write it in terms of the € as 


* * * 1 — 
ur + €y41 = OCU") + 9, (UW Jey + 2 £5 Puy (Hey 


a] 
+ (+40 alr > salou") (€,41 _ ey), 
k 


or, simplifying, noting that u* = g(u*), 


* ) * 1 — 
(: _ QP, (u )- 2 5k ule ») Ey+1 = 2 €) Puy (Wey 
k 


) $i 
= ak Pul(u ek neu 
k 


Now recall from the definition of g, and our smoothness assumptions, that 
9g,(4) = O(h) and also g,,,(@) = O(h). Suppose ¢, = Oh"). By 
assumption, 79 = g > 0, and from the relation above we see that the 
left-hand side is O(h"’+'), while the right-hand side is O(h?""*'), so that 


Ny4+1= 2ry +1. 


The solution of this difference equation, with starting value ro = g, is 
readily seen to be 


ry =2"°(g +1)—1 (Newton). 
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(b) 


For the modified Newton’s method, one gets similarly 


1 7 3 
(7 = (a) ev41 = 51 PunMev— Dy sell uM )e6 - ev, 
k 


so that 
Mherp=mn2y,+1g+n4+1), m=g. 


For v = 0 this gives r, = 2g + 1. Using induction on v, suppose 
ry = v(g + 1) + g for some v > | (true when v = 1). Then, since 


min(27,+1,g+r, + 1) = min((2v+1)(g + 1) +g, + I(g + 14+ 2) 
=(v+ I(gt+1)+e8, 


we obtain 7,41 = (v + 1)(g + 1) + g, which is the induction hypothesis 
with v replaced by v + 1. Therefore, 


ry = v(g+1)+g (modified Newton). 


Finally, in the case of fixed point iteration, from 


| 
&y41 = 9, (u")ey + 581 Pun HE», 


one gets 
Ty+1 =Hryt+ 1, ro = &, 
hence 


ry =v+g_ (successive approximations). 
If ul = u,4,4—1, then, since u* = uy +k, 


0 


* 
eg = ul — u* = ung e—1 — Ute. 


From the multistep formula 


k-1 k 
Un+k + yas =h > Bs Sts = O(h), 


s=0 s=0 
one obtains 
k-1 
€9 = Unt+k—-1 + woe + O(h), 
s=0 


which, in view of uy,4+s5 = Uy, + O(A) and a, = 1, yields 


k 
€ = Un Yas + O(h) = O(h), 


s=0 


by consistency. Thus, g = 1. 
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(c) Solver, = p + 1 and take = [A] if A is not an integer, and uw = A+ 1 
otherwise. For Newton’s method, this gives (when g = 1) 


| 2 
¢ 4, ee Se me eS eid. 


log 2 


For modified Newton, 


ile p even 

P oo 
a h= 1 

7 ne. p odd. 


Finally, for successive approximation, 
A=p, w=prtl. 


(d) If g = p, then, from the results in (a), for Newton’s method, 


, _ well + a) 2 


log2 se ale 
for modified Newton, 
= os <1, w=; 
pt+il 
and for successive approximations, 
A=1, w=2 


10. (a) From (6.148), after some calculation or with the help of Maple, one finds 
the following characteristic polynomials: 


a(t) =t—1 fork =1, 
= 4 1 = 
=P-it+ 3 fork = 2, 


= 43 18 .2 9 2 _ 


— 74 __ 48,3 1 3642 _ 16 3 = 
=t sel aah set I Ge fork = 4, 


_ 5 300,4 ; 300,3  200,2 , 75, 12 - 
OP — al! + igh — gt’ + al — py fork =5, 


_ 6 120,5 , 150,4  400,3 , 75,2 24 10 _ 
=l aol + Gt ia7t + apt — aot + ig fork = 6, 


7 98046 , 490.5 4900.4, 1225.3 196.2 , 490, 20 _ 
t — 360° + Gort? — Gogol + ep — apt + Gogo! — 363 for k = 7. 
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Matlab gives for the roots: 


k rootsp;, i =1,2,..., k max; 4 |; | 
1 1 ——— 
2 1;:3333 3333 
3 1, 3182 + .28397 4264 
4 1, .2693 + .49207, .3815 5609 
5 1, .2100 + .67697, .3848 + .16217 .7087 
6 1,.1453 + .85117, .3762 + .28857, .4061 .8634 
7 1, .0768 + 1.01937, .3628 + .39887, .4103 + .11437 1.0222 


It is seen that the root condition is satisfied for 1 < k < 6, but not for 
k=7. 


(b) From the discussion following (6.143), the local error constant €,+; is the 
coefficient of ¢* in the expansion of ae Lye (e® = (1 —£)7!), hence 


leap = — dest. The global error constant, therefore, is 
dk k+1 
Ceti =—-——- 
ad 


In the particular case of (6.146), one gets 


k 
Ve+1 1 1 
Ceti = =— , a=) -. 
aR (k + la; 2 r 


For Adams—Moulton (cf. (6.60)), one has 


* 


ial — 
Cert = Veo 


since a B; = 1 by consistency. Below is a table of a, Cy.x+41, and 
Cey4, fork = 1:7. 


k Ok Crk Ceti 

1 1 —} =—.500 —} =-.500 

2 3 —4 =—.148 —4 = —.0833 
3 7 y= —.0744 — = —.0417 

4 a tit = —.0461 — i = —.0264 
5 it Sit, = —.0320 — 35 = —.0188 
6 a i. = —.0238 — AS = —.0143 
i a9 = 0186 Hi = —0114 


Both error constants are negative for all k, those for Adams—Moulton 
consistently smaller in absolute value (for 2 < k < 7) than those for 
backward differentiation, by a factor of about 0.6. 
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6. (a) 
PROGRAMS 


SMAVI_6A 


2 
6 


yo=[1;3]; xspan=[0 10]; 
[x,y] =ode45 (@fMAVI_6,xspan,y0) ; 
figure 
plot (x,y) 
sy=size(y,1); 
normJ=zeros(sy,1); 
for i=l:sy 
z=[y(i,1);y(i,2)]; 
normd (i) =norm(J£MAVI_6(x(i),z),inf) ; 
end; 
figure 
plot (x,normJ) 


SFMAVI_6 Differential equation for MAVI_ 6 


2 
6 


function yprime=fMAVI 6 (x,y) 
yprime=[2«y(1)«(1-y(2)) ;-y(2)*(1-y(1))]; 


sJEMAVI_ 6 Jacobian matrix of £fMAVI_6 


function J=J£fMAVI_6 (x,y) 
J=[2« (1-y(2)) -2*y(1);y(2) -(1-y(1))]1; 


7 if, Il 
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The graph on the right shows that the differential equation is definitely 


nonstiff. 
(b) For the Matlab program pertaining to this part (b), see the beginning of the 
program shown in part (c). 


OUTPUT 
>> MAVI 6BC 
h=0.00390625 N=2560 


>> 


(c) The pair of fourth-order methods is (cf. Ex. 9) 

3 
Un+t4 = Uns+3 + h > Bs fants, 
s=0 


4 
x ox x pk 
Unta = Un+3 a h >. By i piwe 


s=0 


where fn+s = f (Xn+s,Unts)» fis = £ (nts, U7 4,), and the coefficients 
Bs, B* are given by 


246,  3608* 


Ss 
0 -9 116 
1 37. -—449 
2 —59 621 
3 55. «+179 
4 251 


To compute u*, , in the implicit method, we write the latter in the form 


3 
F (uy 4.4) = Unga f Anta. Uneg—-8 =O, g = Ujysth ie ee 


s=0 
and apply to it Newton’s method, 


Jp WA; = ull — ABt f (inta,u") — g, 
uit = yl — A,, 


where 


* 2(1—y2) —2y1 
Je) =1-MBSyl5nes.0), £0) =| , 
4Jy y yz -(1 _ y1) 
If we take ull = u,,44-the approximation obtained from the explicit 
formula—we find that never more than three iterations (mostly 2) are required 
to iterate to an accuracy of about eight decimal digits. 
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PROGRAM 
MAVI_6BC 


Part (b) 


oP oP oP ol? 


N=10; Y2=zeros(2,N); maxerr=1; 
while maxerr>.5e-8 
N=2«N; h=10/N; Y=Y2; 
err=zeros (1,N/2) 
yi=[1;3]; 
for n=1:N 
x=(n-1) *«h; 
y=yl;i 
y1=RK4 (@fMAVI_6,x,y,h) ; 
Y2(:,n)=yl; 
if floor (n/2)==n/2 
err (n/2) =norm(Y(:,n/2)-Y2(:,n),inf) ; 
end 
end 
maxerr=max (err) ; 
end 
epeanee h=%$10.8£ N=%3.0f\n’ ,2*h,N/2) 


Part (c) 


oP ol ol? 


g=zeros(2,1); err=zeros(2,1); 

N=640; 

h=10/N; fac=5120/N; U=zeros(2,N); 
Ustar=zeros(2,N); 

F=zeros(2,N); Fstar=zeros(2,N); u0=[1;3]; 

U(:,1)=RK4 (@£MAVI _6,0,u0,h); F(:,1) 
=fMAVI_6(h,U(:,1)); 

Ustar(:,1)=U(:,1); Fstar(:,1)=F(:,1); 

for s=2:3 
U(:,8S) =RK4 (@£MAVI_6,h,U(:,s-1),h); 


F(:,S)=fMAVI_6(s*h,U(:,s)); 
Ustar(:,s)=U(:,8); Fstar(:,s)=F(:,8s); 
end 
U(:,4)=U(: h/24) *(55*F(:,3)-59 


+ ( 
*F(:, ne 1)-9*fMAVI_6(0,u0)); 
B(:,4)=fMAVI 6 (4*h,U(:,4)); 
g=Ustar(:,3)+(h/360) « (-179«Fstar(:,3) 
+621*Fstar(:,2)-449*Fstar(:,1) 
+116*fMAVI_6(0,u0)); 
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u4new=U(:,4); d=[1;1]; 

while norm(d,inf)>.5e-8 
u4=u4new; 
J=eye (2) - (251/360) *h*xJEMAVI_6 (4*h,u4) ; 
d=J\ (u4- (251/360) *h*fMAVI_6 (4*h,u4) -g) ; 
u4new=u4-d; 

end 


Ustar(:,4)=u4new; Fstar(:,4) 
=fMAVI_6(4*h,Ustar(:,4)); 
for n=5:N 
U(:,n)=U(:,n-1)4+(h/24) « (55*F(:,n-1) 
-59x*F(:,n- Rr at ame es 


F(:,n)=f£MAVI_ 6(n*h,U(:,n)); 
g=Ustar(:,n-1)+(h/360)*(-179 
*Fstar(:,n- ee 
-449x«Fstar(:,n-3)+116*Fstar(:,n-4)); 
unnew=U(:,n); d=[1;1]; 
while norm(d,inf)>.5e-8 
un=unnew; 
J=eye (2) - (251/360) *h*xJ£MAVI_6 (n*h,un) ; 
d=J\ (un- (251/360) «h*fMAVI_6 (n*h,un) -g) ; 
unnew=un-d; 
end 
Ustar(:,n)=unnew; Fstar(:,n) 
=fMAVI_6(n«h,Ustar(:,n)) ; 


end 

f1N=facx*(1:N); xp=h« . :N) 

errul=h* (-4)+*(U(1,:)-Y2(1, oe 
errulstar=h” (-4) « (star (2 )- om £1N))’ 
erru2=h* (-4)*(U(2,:) - 22) fa) 
erru2star=h”* (-4) « aie ie )- 2(2, £1N))’ 
figure 

hold on 


plot (xp,errul1) 

plot (xp,errulstar) 

plot([0 10],[0 0]) 

text (2.5,400,’h*{-4} errul’, 
'FontSize’ ,14) 

text (2.5,-400,’h*{-4} errul*.*’, 
'FontSize’ ,14) 

hold off 

figure 

hold on 

plot (xp,erru2) 
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plot (xp,erru2star) 

plot([0 10], [0 0]) 

text (3.2,500,'’h*{-4} erru2’, 
‘FontSize’ ,14) 

text (3.2,-500,'’h*{-4} erru2*«’, 
'FontSize’ ,14) 

hold off 


PLOTS (for N=60) 


h~ erru2 


Dy 
3 
° is} 


hr? erru2” 


The graphs on the left show the first components h~*(u} — y!), A~*(u*! — y}) 
of the scaled errors, those on the right the second components h~*(u2 — y?), 
h~*(u*? — y2). They are approximately proportional to the respective global 
error constants C44 and Cj, of the explicit and implicit method, where 

44 = —C4,4 by construction. Thus the graphs are essentially symmetric 
with respect to the real axis. It can also be seen that the error bounds 
switch their roles from upper to lower bounds, and vice versa, several 
times during the integration (three times for the first component, four times 
for the second component). Evidently, these are the places where the first 
resp. second component of the solution e(x) of the variational differential 
equation crosses the real axis (cf. Theorem 6.3.5, (6.107)). 


Chapter 7 
Two-Point Boundary Value Problems for ODEs 


Many problems in applied mathematics require solutions of differential equations 
specified by conditions at more than one point of the independent variable. 
These are called boundary value problems; they are considerably more difficult 
to deal with than initial value problems, largely because of their global nature. 
Unlike (local) existence and uniqueness theorems known for initial value problems 
(cf. Theorem 5.3.1), there are no comparably general theorems for boundary value 
problems. Neither existence nor uniqueness is, in general, guaranteed. 

We concentrate here on two-point boundary value problems, in which the system 
of differential equations 


d 

5 = SO), f: [a,b] xR? > R4, d>2, (7.1) 
x 

is supplemented by conditions at the two endpoints a and b. In the most general 

case they take the form 


g(y(a), y(b)) = 9, (72) 


where g is a nonlinear mapping g: R¢ x R¢ — R@. Often, however, they are linear 
and even of the very special kind in which some components of y are prescribed at 
one endpoint, and some (other or the same) components at the other endpoint, the 
total number of conditions being equal to the dimension d of the system. 

There are other important problems, such as eigenvalue problems and problems 
with free boundary, that can be transformed to two-point boundary value problems 
and, therefore, also solved numerically in this manner. 

An eigenvalue problem is an overdetermined problem containing a parameter 
A, say, 


d 
~ = f(x,y;A), G=axse,; (7.3) 
dx 
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with f as in (7.1), but depending on an additional scalar parameter A, and d + 1 
boundary conditions (instead of d) of the form 


g(y(a), y(b);4)=0, g:R¢xR? xR RR", (7.4) 


The best-known example of an eigenvalue problem is the Sturm—Liouville 
problem, where one seeks a nontrivial solution u of 


d 


— (po) +q(x)u=dAu, a<x <b, (7.5) 
dx dx 


subject to homogeneous boundary conditions 
u(a)=0, u(b)=0. (7.6) 


Since with any solution of (7.5) and (7.6) every constant multiple of it is also a 
solution, we may specify, in addition to (7.6), that (for example) 


u(a)=1, (7.7) 


and in this way also make sure that u # 0. The problem then becomes of the form 
(7.3), (7.4), if (7.5) is written as a system of two first-order equations (cf. (5.18)— 
(5.20) of Chap. 5). In this particular case, f in (7.3) is linear homogeneous, and g 
in (7.4) is also linear and independent of A. Normally, (7.5) will have no solution 
satisfying all three boundary conditions (7.6) and (7.7), except for special values of 
A; these are called eigenvalues of the problem (7.5)—(7.7). Similarly, there will be 
exceptional values of A — again called eigenvalues — for which (7.3) and (7.4) admit 
a solution. 

To write (7.3) and (7.4) as a two-point boundary value problem, we introduce an 
additional component and associated (trivial) differential equation, 


di — ‘ ayer" 


=0 
dx 


and simply adjoin this to (7.3). That is, we let 


y d+l 
Y= ER 
a 
and write (7.3) and (7.4) in the form 


~ = F(x,Y), a<x<b; G(¥(a),Y(b)) =0, (7.8) 
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where 


Gt 
F(x,Y) = lear |. G(¥ (a). ¥ (b)) = g(v@, yb): y"*"(a)). 


(7.9) 
Thus, for example, the Sturm—Liouville problem (7.5) and (7.6), with 
du 
y=Hu, yo=P(x)—, Ww=A, 
dx 
becomes the standard two-point boundary value problem 
dy; = 1 
dx p(x)?” 
dy2 
De 7 IO) + Yai, 
x 
d 
Dee ty, (7.10) 
dx 
subject to 
yila)=0, yilb)=90, y2(a) = p(a). (7.11) 


If one of the boundary points, say, b, is unknown, then the problem (7.1) and 
(7.2), where now g: R¢ x R¢ — R¢*!, is a problem with free boundary. This too 
can be reduced to an ordinary two-point boundary value problem if one sets 


Zt = b—a, 


which is a constant as far as dependence on x is concerned, and introduces a new 
independent variable ¢t by 


Letting then 
t 
2) =yatizt), Zo= Be | 
gives 


dZ Zt fia +t2Z*z) 
—_— = , Beret, 
dt 0 (7:12) 


g(z(0),z(1)) = 9, 
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a two-point boundary value problem on a fixed interval, [0,1]. Once it is solved, one 
recovers b = a + 24+! and y(x) = 2((x —a)/z4t+!),a <x <b. 

We begin with the problem of existence and uniqueness, both for linear and for 
nonlinear boundary value problems. We then show how initial value techniques 
can be employed to solve boundary value problems and discuss some of the 
practical difficulties associated with that. Approaches relying more on systems of 
linear or nonlinear equations are those based on finite difference or variational 
approximations, and we give a brief account of these as well. 


7.1 Existence and Uniqueness 


Before dealing with general considerations on existence and uniqueness (or 
nonuniqueness) of solutions to boundary value problems, it may be useful to 
look at some very simple but instructive examples. 


7.1.1 Examples 


For linear problems the breakdown of uniqueness or existence is exceptional and 
occurs, if at all, only for some “critical” intervals, often denumerably infinite in 
number. For nonlinear problems, the situation can be more complex. 


Example. y"” — y = 0, y(0) = 0, y(b) = B. 

The general solution here is made up of the hyperbolic cosine and sine. Since the 
hyperbolic cosine is ruled out by the first boundary condition, one obtains from the 
second boundary condition uniquely 


sinh x 
0<x <b. (7.13) 


mC) =P sinh b ° a 


There are no exceptional (critical) intervals here. 


Example. y"” + y = 0, y(0) = 0, y(b) = B. 

Although this problem differs only slightly from the one in the previous Example, 
the structure of the solution is fundamentally different because of the oscillatory 
nature of the general solution, consisting of the trigonometric cosine and sine. If b 
is not an integer multiple of zr, there is a unique solution as before, 


sin x 


sinb ’ 


y(x) = B bAnnx (n= 1,2,3,...). (7.14) 


If, however, b = nz, then there are infinitely many solutions, or none, accordingly 
as B = 0 or 6B ¥ 0. In the former case, all solutions have the form y(x) = c sinx, 
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with c an arbitrary constant. In the latter case, the second boundary condition cannot 
be satisfied since every solution candidate must necessarily vanish at b. In either 
case, b = nz is a critical point. 

We now minimally modify these examples to make them nonlinear. 


Example. y" + |y| = 0, y(0) = 0, y(b) = B. 

As a preliminary consideration, suppose y(xo) = O for a solution of the 
differential equation. What can we say about y(x) for x > xo? 

We distinguish three cases: (1) y’(xo)<0O. In this case, y becomes 
negative to the right of xo, and hence the first Example applies: the solution 
becomes, and remains, a negative hyperbolic sine, y(x) = csinh(x — xo), 
c <0. (2) y’(xo) = 0. By the uniqueness of the initial value problem (the differential 
equation satisfies a uniform Lipschitz condition with Lipschitz constant 1), we get 
y(x) = 0 for x > x0. (3) y’(xo) > 0. Now y is positive on a right neighborhood of 
Xo, so that by the second Example, y(x) = c sin(x—Xxo),c > 0, for xp < x < xo+z. 
At x = Xo + 7, however, y(xo + 2) = 0, y'(xo + 2) = —c <0, and by what was 
said in Case (1), from then on we have y(x) = —c sinh(x — xo — 7) (which ensures 
continuity of the first derivative of y at x = xo + 2). Thus, in this third case, the 
solution y(x), x > xo, consists of two arcs, a trigonometric sine arc followed by a 
hyperbolic sine arc. 

To discuss the solution of the boundary value problem at hand, we distinguish 
again three cases. In each case (and subcase) one arrives at the solution by 
considering all three possibilities y’(0) < 0, y’(0) = 0, y’(0) > 0, and eliminating 
all but one. 


Casel: 6b <1. Here we have the unique solution 


0 if B=0, 
y(x) = 4 B = p= (7.15) 
sinb 
sinh x 
if 0. 
p sinh b es 


Casell: b=n. 


(a) 6 = 0: infinitely many solutions y(x) = c sinx,c > 0 arbitrary. 
(b) B > 0: no solution. 


inh 
(c) B < 0: unique solution y(x) = B pene 


sinh x © 
Case lll: b>. 


(a) B = 0: unique solution y(x) = 0. 
(b) B > 0: no solution. 
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Table 7.1 Number of solutions of the 
boundary value problem in Example 7.3 


b B>0 B=0 B<0O 
<a 1 1 1 
= 0 lo.e) 1 
> 0 1 2 


(c) B <0: exactly two solutions, 


inh » 
yi(x) = B ——, Weed (7.16) 
sinh b 
sin x firate 
SS Pr Sea ae Le SX, 
y2(x) = onl) (7.17) 
sinh(x — 7) 


x <b. 


Se < 
sinh(b — zr) ’ = 
In summary, we indicate the number of solutions in Table 7.1. It is rather remarkable 


how the seemingly innocuous modification of changing y to |y| produces such a 
profound change in the qualitative behavior of the solution. 


7.1.2 A Scalar Boundary Value Problem 


A problem of some importance is the two-point boundary value problem for a scalar 
nonlinear second-order differential equation 


WV SIE ys BSH ES, (7.18) 
with linear boundary conditions 
agy(a) — ayy'(a) = a, 
boy(b) + biy'(b) = B, (7.19) 
where we assume, of course, that not both ado and a, are zero, and similarly for 


bo and b;. We further assume that f is continuous on [a,b] x R x R and satisfies 
uniform Lipschitz conditions 


| f(x, up u2) — f(x, ui, u2)| < Liluy — ui], 


| f(x, u1,u5) — f(x, U1, u2)| < L2|u5 — up| (7.20) 
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for all x € [a,b] and all real u;, u2, uj , u>. These assumptions are sufficient to 
ensure that each initial value problem for (7.18) has a unique solution on the whole 
interval [a, b] (cf. Theorem 5.3.1). 

We associate with (7.18) and (7.19) the initial value problem 


=F): @2x SH, (7.21) 


subject to 
agu(a) — ayu'(a) = a, 
cou(a) — c\u'(a) = s. (7.22) 


For the two initial conditions in (7.22) to be linearly independent, we must 
assume that 


aet| i 


J #0; thatis, copa; — cdo 4 0. 
Co — Cy 


Since we are otherwise free to choose the constants co, c; as we please, we may as 
well take them to satisfy 


Coa, — Ciao = 1. (7.23) 
Then the initial conditions become 
u(a) = ais —c\aQ, 
u'(a) = ays — Cod. (7.24) 
We consider co, c; to be fixed from now on, and s a parameter to be determined. 


The solution of the initial value problem (7.21) and (7.24) is denoted by u(x; s). 
If it is to solve the boundary value problem (7.18) and (7.19), we must have 


b(s) =0, p(s) := bou(b; s) + biu'(b; s) — B. (7.25) 


Here and in the following, the prime in w(x; s) always indicates differentiation with 
respect to the first variable, x. Clearly, (7.25) is a nonlinear equation in the unknown 
S (cf. Chap. 4). 


Theorem 7.1.1. The boundary value problem (7.18) and (7.19) has as many 
distinct solutions as $(Ss) has distinct zeros. 


Proof. (a) If #(s:) = 0, then clearly u(x; 51) is a solution of the boundary value 
problem (7.18) and (7.19). If s) ~ s, is another zero of (s), then by (7.24) 
either u(a; 52) 4 u(a,s,) (if ay 4 0) or u’(a; 52) 4 u'(a; 51) (if ag 0); that 
is, u(x; 82) # u(x; 51). Thus, to two distinct zeros of ¢(s) there correspond two 
distinct solutions of (7.18) and (7.19). 
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(b) If y(x) is a solution of the boundary value problem (7.18) and (7.19), then 
defining s := coy(a) — cy y’(a), we have that y(x) = u(x; s); hence $(s) = 0. 
Thus, to every solution of (7.18) and (7.19) there corresponds a zeroof¢. O 


Theorem 7.1.1 is the basis for solving (7.18) and (7.19) numerically. Solve 
o(s) = O by any of the standard methods for solving nonlinear equations. We 
discuss this in more detail in Sect. 7.2. 

For a certain class of boundary value problems (7.18) and (7.19) one can show 
that @(s) = 0 has exactly one solution. 


Theorem 7.1.2. Assume that 


(1) f(x, uy, U2) is continuous on [a,b] x R x R. 
(2) Both fy, and fy, are continuous and satisfy 


0 < fu (X%,u1.u2) < Li, | fi. (X%, U1, u2)| < Lo on [a,b] x RX R. 


(3) doa; = 0, bob; = 0, |ao| + |bo| > 0. 


Then the boundary value problem (7.18) and (7.19) has a unique solution. 


Note that Assumption (2) implies (7.20), hence unique solvability on [a, b] of 
initial value problems for (7.18). Assumption (3) requires that do and a, be of the 
same sign, as well as bo and 5), and that not both ao and bo vanish. We may assume, 
by multiplying one or both of the boundary conditions (7.19) by —1 if necessary, that 


(3) ao > 0,a; > 0; bo = 0,b; = 0; ay + bo > O. 
Proof of Theorem 7.1.2. The idea of the proof is to show that 
¢'(s)>c>0 forall sER. (7.26) 


The function ¢(s) then increases monotonically from —oo to +00, and hence 
vanishes for exactly one value of s. 
We have 
/ ) 0 / 
¢' (s) = bp — ub; 8) +b; —u'(b;s). 
os Os 


It is convenient to denote 
7) 
v(x) = — u(x;s), 
os 


where the dependence on s is suppressed in the notation for v. Since differentiation 
with respect to x and s may be interchanged under the assumptions made, we can 
write 


o'(s) = bov(b) + byv'(b). (7.27) 
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Furthermore, u(x; s) satisfies, identically in s, 
u' (xis) = f(x,u(x;s),ul(xis)), a<x <b, 
u(a;s) =a,s —cya, u'(a;s) = aos — cog, 


from which, by differentiation with respect to s and interchange with differentiation 
in x where necessary, one gets 


V" (x) = fay (X, uC; 8), (x5 8) VX) + fin (2, ws 5), u'(x3 5) v(x), 
v(a)=a;, Vv (a) = ao. (7.28) 
Thus, v is the solution of a “linear” boundary value problem, 


v" = p(x)’ +q(x)v, as<x<b, 
v(a)=a,, vV(a) =a, (7.29) 
where 
lp(x)| < L2, O<q(x)< Li on{a,d]. (7.30) 
We are going to show that, ona < x <b, 


e L204) 
7 , W(x) > ane 2-9), (7.31) 
2 


v(x) > a, + do 


From this, (7.26) will follow. Indeed, since not both ap and a; can vanish and by (3’) 
at least one is positive, it follows from (7.31) that v(b) > 0. If bp > 0, then (7.27) 
shows, since b} > 0 and v’/(b) > 0 by (7.31), that ¢’(s) is positive and bounded 
away from 0 (as a function of s). The same conclusion follows if bo = 0, since then 
b, > Oand ¢’(s) = bi v'(b) > 0 in (7.27). 

To prove (7.31), we first show that v(x) > 0 fora < x < b. This is certainly true 
in a small right neighborhood of a, since by (7.29) either v(a) > 0 or v(a) = 0 and 
v'(a) > 0. If the assertion were false, we would therefore have v(xo) = 0 for some 
Xo in (a, b]. But then v must have a local maximum at some x; witha < x; < Xo. 
This is clear in the cases where v(a) = 0, v’(a) > 0, and v(a) > 0, v’(a) > 0. In the 
remaining case v(a) > 0, v’(a) = 0, it follows from the fact that then v’(a) > 0 by 
virtue of the differential equation in (7.29) and the positivity of g (cf. (7.30)). Thus, 


v(x1) > 0, v(x) =0, v(x) <0. 


But this contradicts the differential equation (7.29) at x = xj, since 
q(x1) > 0. This establishes the positivity of v on (a,b]. We thus have, using 
again the positivity of q, 


v"(x) — p(x)v'(x) > 0 for a<x <b. 
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Multiplication by the “integrating factor” exp (- i p(t)dt) yields 


a le Peay (x)] > 0, 
x 
and upon integration from a to x, 
en Sa POA () — v(a) > 0. 
This, in turn, by the second initial condition in (7.29), gives 
V(x) > agela Pat 
from which the second inequality in (7.31) follows by virtue of p(t) > —Lo 


(cf. (7.30)). The first inequality in (7.31) follows by integrating the second from 
atox. Oo 


Theorem 7.1.2 has an immediate application to the Sturm—Liouville problem 
Ly=r(x), a<x<b; By =a, By =8, (7.32) 
where 


Ly = —y" + p(x)y’ + a(x)y, 
Bay = aoy(a)—ayy'(a), Boy = boy(b) + biy'(b). (7.33) 


Corollary 7.1.1. [f p, g, andr are continuous on [a, b] with 
q(x) >0 fora<x <b, (7.34) 


and if ao, a, bo, by satisfy the condition (3) of Theorem 7.1.2, then (7.32) has a 
unique solution. 


We remark that the differential equation in (7.32) can be written equivalently in 
“self-adjoint form” if we multiply both sides by P(x) = exp(— f pdx). This yields 


- a (ro?) + O(x)y = R(x), a<x<b, (7.35) 
dx dx 


with 
Q(x) = P(x)q(x), R(x) = P(x)r(x). 
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Note that P in (7.35) is positive and not only continuous, but also continuously 
differentiable on [a, b]. Furthermore, the positivity of g is equivalent to the positivity 


of QO. 


The following is a result of a somewhat different nature, providing an alternative. 


Theorem 7.1.3. The boundary value problem (7.32) has a unique solution for 
arbitrary a and B if and only if the corresponding homogeneous problem (with 
r = 0, a = B = 0) has only the trivial solution y = 0. 


Proof. Define u; and uz by 
Lu; =r(x), a<x<b; uj(a) = —c\a, ui) (a) = —coe, 


and 


Lup =0, a<x <b; wa) =a), uy(a) = a, 


with co, c; as defined in (7.23). Then one easily verifies that B,u; = a, Bguz = 0, 
so that 


u(x) = u(x) + suo(x) (7.36) 


satisfies both the inhomogeneous differential equation and the first boundary 
condition of (7.32). The inhomogeneous boundary value problem therefore has a 
unique solution if and only if 


Byun # 0, (7.37) 


so that the second boundary condition B,u = £ can be solved uniquely for s in 
(7.36). On the other hand, for the homogeneous boundary value problem to have 
only the trivial solution, we must have (7.37), since otherwise Luz = 0, Bau2 = 0, 
and B,u2 = 0, whereas one of u2(a) and u5(a) must be different from zero, since 
not both a; and do can vanish. oO 


7.1.3 General Linear and Nonlinear Systems 


The two-point boundary value problem for the general nonlinear system (7.1), with 
linear boundary conditions, takes the form 


i F@wy), Geax =2, 
dx 
Ay(a) + By(b)=y, (7.38) 


where A, B are square matrices of order d with constant elements, and y is a given 
d-vector. For linear independence and consistency we assume that 


rank [A, B] = d. (7.39) 
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In this general form, the associated initial value problem is 


a = F(x.u), fi xsd, 
dx 
u(a) = Ss, (7.40) 


where s € R¢ is a “trial” initial vector. If we denote by u(x; ) the solution of (7.40) 
and assume it to exist on [a, b], then (7.38) is equivalent to a problem of solving a 
nonlinear system of equations, 


o(s)=0, (s)=As 4+ Bu(b;s)-y. (7.41) 


Again, the boundary value problem (7.38) has as many distinct solutions as the 
nonlinear system (7.41) has distinct solution vectors. By imposing sufficiently 
strong — but often unrealistic — conditions on f, A, and B, it is possible to prove 
that (7.41), and hence (7.38), has a unique solution, but we do not pursue this any 
further. 

For linear systems, we have 


f(x,y) =C(x)yt+d(x) a<x<b, (7.42) 
in which case the initial value problem (7.40) is known to have the solution 
u(x) = Y(x)s + v(x), (7.43) 


where Y(x) € R¢*¢ is a fundamental solution of the homogeneous system 
dY/dx = C(x)Y with initial value Y(a) = J, and v(x) a particular solution 
of the inhomogeneous system dv/dx = C(x)v + d(x) satisfying v(a) = 0. The 
boundary value problem (7.38) is then equivalent to the system of linear algebraic 
equations 


[A + BY(b)|s = y — Bv(b) (7.44) 


and has a unique solution if and only if the matrix of this system is nonsingular. 

We remark that if some components of y(a) are prescribed as part of the 
boundary conditions in (7.38), then of course they are incorporated in the vector 
s, and one obtains a smaller system of nonlinear (resp., linear) equations in the 
remaining (unknown) components of s. 


7.2 Initial Value Techniques 


The techniques used in Sects.7.1.2 and 7.1.3 are also of computational interest 
in that they lend themselves to the application of numerical methods for solving 
nonlinear equations or systems of equations. We show, for example, how Newton’s 
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method can be used in this context, first for a scalar second-order boundary value 
problem, and then for a general problem involving a first-order system of differential 
equations. 


7.2.1 Shooting Method for a Scalar Boundary Value Problem 


We have seen in Sect. 7.1.2 that the boundary value problem (7.18) and (7.19) leads 
to the nonlinear equation (7.25) via the initial value problem (7.21) and (7.24). 
Solving the initial value problem is referred to, in this context, as “shooting.” One 
aims by means of trial initial conditions to satisfy the second boundary condition, 
which is the “target.” A mechanism of readjusting the aim, based on the amount by 
which the target has been missed, is provided by Newton’s method. Specifically, one 
starts with an initial approximation s for s in (7.25), and then iterates according to 


srD = a (s) 
p!(s) ” 


until, it is hoped, gs) —s 54, as v > oo. If that occurs, then V(X) = U(X} Soo) will 
be a solution of the boundary value problem. If there is more than one solution, the 
process (7.45) needs to be repeated, perhaps several times, with different starting 
values 5, 

For any given s, the values of @(s) and ¢'(s) needed in (7.45) are computed 
simultaneously by “shooting,” that is, by solving the initial value problem (7.21) 
and (7.24) together with the one in (7.28) obtained by differentiation with respect to 
s. If both are written as first-order systems, by letting 


v=0,1,2,..., (7.45) 


yi(x) = u(x;s), yo(x) = u'(x; 8), y3(%) = v(x), ya(x) = V'(>), 


one solves on [a, b] the initial value problem 


dy; 
ae yi(a) = ais — cya, 
dx 
dy2 
— = f(x, y1, 2), y2(a) = aos — com, 
dx (7.46) 
dy3 
— =, y3(a) = ay, 
dx 
dy4 
a Fin (%, V1, Yad V3 + fu(%, V1, V2) V4, Yala) = ao, 


with co, c; as chosen in (7.23), and then computes 
b(s) = boyi(b) + biya(b) — B, $'(s) = boys(b) + biya(). Al) 


Thus, each Newton step (7.45) requires the solution on [a,b] of an initial value 
problem (7.46) with s = s™. 
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Example. y" = —e’,0<x <1, y(0) = y(1) =0. 
We first show that this problem has a unique solution. To do this, we “embed” 
the problem into the following problem: 


y” = f(y), O< x <1; yO) = yd) = 0, (7.48) 
where 


-e’ if y>=0, 
f(y) = (7.49) 
e-—-2 if y<0O. 


Then f\(y) = e if y > Oand f,(y) = e” if y < 0, so thatO < fi(y) < 1 
for all real y. Thus, Assumption (2) of Theorem7.1.2 is satisfied, and so are 
(trivially) the other two assumptions. It follows that (7.48) has a unique solution, 
and since clearly this solution cannot become negative (the second derivative being 
necessarily negative), it also solves the problem in our Example. 


Since dy) = bo = 1, a; = b; = a = Bf = Oin this example, the system (7.46) 
becomes, for 0 < x < 1, 


d 

a y2, yi(a) = 0, 
dx 

d 

—=-e", yO)=s, 
dx 

d 

Sy, — y3(0) =0, 
dx 

d 

as = e ys, ya(0) = |, 
dx 


and (7.47) simplifies to 


o(s)= yi), $"(s) = y3(1). 


Newton’s method (7.45), of course, has to be started with a positive initial 
approximation s. If we use s = 1 and an error tolerance of } 10~!, it produces, 
with the help of Matlab routine ode45, the results shown in Table 7.2 (cf. also 
Ex. 3). 


Example. y"” = Asinh(Ay),0 < x < 1, y(0) = yd) = 0. 

If y’(0) = s, s # 0, the solution y of the differential equation has the sign of 
s in a right neighborhood of x = 0, and so does y”. Thus, |y| is monotonically 
increasing and cannot attain the value zero at x = 1. It follows that y(x) = 0 is 
the unique solution of the boundary value problem. Moreover, it can be shown (see 
Ex. 8(c)) that for y’(0) = s the modulus |y| of the solution tends to infinity at a 
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Table 7.2 Numerical results for the 
first Example 
g) 


v 

0 1.0 

1 0.45 

2 0.463630 

3 0.463632593668 

finite value Xo of x, where 
1 8 
Xoo ~ ~-In—, sd. (7.50) 

A |s| 


Thus, if we apply shooting with y’(0) = s, we can reach the end point x = 1 
without the solution blowing up on us only if Xoo > 1, that is, approximately, if 


|s| < 8e7*. (7.51) 


This places a severe restriction on the permissible initial slope; for example, |s| < 
3.63...x 1074 if A = 10 and |s| < 1.64... 107° if A = 20. It is indeed one of the 
limitations of “ordinary” shooting that rather accurate initial data must be known in 
order to succeed. 


7.2.2 Linear and Nonlinear Systems 


For linear systems, the shooting method amounts to solving the linear system of 
algebraic equations (7.44), which requires the numerical solution of d + 1 initial 
value problems to obtain Y(b) and v(b), and possibly one more final integration, 
with the starting vector found, to determine the solution y (x) at the desired values of 
x. Itis strictly a superposition method and no iteration is required, but the procedure 
often suffers from ill-conditioning of the matrix involved. 

For the general nonlinear boundary value problem (7.38), there is no difficulty, 
formally, in defining a shooting method. One simply has to solve the system of 
nonlinear equations (7.41), for example by Newton’s method, 


s0t) = 5% 4 A, 
ag v=0,1,2,..., (7.52) 


ic (s)A, = —¢(s) 
os 
where 0¢/0ds is the Jacobian of @, 


ab au(b:s) 
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With u(x; s) denoting, as before, the solution of (7.40), we let V (x) be its Jacobian 
(with respect to s), 


Then, in order to simultaneously compute $(s) and (d@/ds)(s), we can integrate 
the initial value problem (7.40) adjoined with that for its Jacobian, 


= f(x) 
pay a<x<b, (7.53) 
xX 


u(a)=s, V(a)=I 


and get 


o(s) = As + Bu(b;s)—y, “¢ (s)= A+ BV(b). (7.54) 


Although the procedure is formally straightforward, it is difficult to implement. 
For one thing, we may not be able to integrate (7.53) on the whole interval [a, b]; 
some component may blow up before we reach the endpoint b (cf. the second 
Example of Sect. 7.2.1). Another problem has to do with the convergence of 
Newton’s method (7.52). Typically, this requires s in (7.53) to be very close to 
Soo — the true initial vector for one of the possible solutions of the boundary 
value problem. A good illustration of these difficulties is provided by the following 
example. 


Example. 
Oi 
dx Dred, (7.55) 
Oe Ne 
dx V4 


yi(0) =1, yi(1) = -e (= 2.718...). 


This is really a linear system in disguise, namely, the one for the reciprocal functions 
oe and V5: Hence it is easily seen that an exact solution to (7.55) is (cf. Ex. 4) 


yi(x) = yo(x) =e", OSx<1, (7.56) 


We write the system in this complicated nonlinear form to bring out the difficulties 
inherent in the shooting method. 
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Since y; is known at x = 0, we have only one unknown, s = y2(0). We thus 
define u(x; 8), u2(x; 5) to be the solution of the initial value problem 


= [a= wl O<x<1: ie }o=[], (7.57) 
dx u2 us u2 Ss 


The equation to be solved then is 


o(s) =0, P(s) = uss) —e. (7.58) 


We denote Sur (8) = vi(x), Bur(58) = v2(x), differentiate (7.57) with respect to s, 


and note that the Jacobian of (7.57) is 


2uy uy 

ne? 

uy Uu uy 
Sy (U1, U2) = A 2 I y=| : 

uy up U2 

ua uy 


If we append the differentiated system to the original one, we get 


d 2 
paler sal (0) = 1, 
dx u2 
d 2 
mae ur(0) = s, 
_ = ; (7.59) 
dv; 2uy uy 
= vi 2 v2, v1 (0) = 0, 
dx ur U5 
dv us 2u 
2 =-2 14 Sis v2(0) = 1. 
dx uy uj 
Assuming this can be solved on [0,1], we will have 
o(s) =m(1;s)—e, $(s) =v), (7.60) 
and can thus apply Newton’s method, 
p(s) 
s@tD = 5) (7.61) 


p'(s™) ° 


taking s = s in (7.59). 
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Because of the elementary nature of the system (7.57), we can solve the initial 
value problem (7.57) in closed form (cf. Ex. 4), 


Ss Ss 
whe) scoshx — sinh x ’ a ie coshx — s sinhx © Ge) 
It is convenient to look at the solutions of (7.57) as curves in the phase plane (uw, u2). 
The desired solution, by (7.56), is then given by uw) = up. 

Clearly, s has to be positive in (7.57), since otherwise u; would initially decrease 
and, to meet the condition at x = 1, would have to turn around and have a vanishing 
derivative at some point in (0,1). That would cause a problem in (7.57), since either 
uy = 0 or uz = ow at that point. 

For wu, to remain bounded on [0,1], it then follows from the first relation in (7.62) 
that we must have s > tanh x for 0 < x < 1; that is, 


s >tanhl = 0.76159... . 
At s = tanh 1, we have 
lim u(x; tanh 1) = ov, lim u2(x; tanh 1) = sinh] = 1.1752.... 


This solution, in the phase plane, has a horizontal asymptote. 
Similarly, for u2 to remain bounded on [0,1], we must have 


s <coth! = 1.3130.... 
When s = coth1, then 


lim u(x; coth 1) = cosh1 = 1.5430... , lim u2(x;coth 1) = oo, 


giving a solution with a vertical asymptote. The locus of points [u)(1;5), 
u2(1; s)] as tanh 1 < s < coth 1 is easily found to be the hyperbola 


uy, sinh 1 
u, = ————_... 
u, — cosh | 


From this, we get a complete picture in the phase plane of all solutions of (7.57); see 
Fig. 7.1. Thus, only in a relatively small s-interval, 0.76159... < s < 1.3130..., 
it is possible to shoot from the initial point to the endpoint without one component 
of the solution blowing up in between. 

What about the convergence of Newton’s method (7.61)? The equation to be 
solved is (7.58), that is, 


S 


b(s) =0, p(s) = 


scoshl—sinhl — - 
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20 
18 
16 


Fig. 7.1 The solut 
tanh 1 < s < cot 
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ions of (7.57) in the phase plane. The solid lines are solutions for fixed s in 
h | and x varying between 0 and 1. The dashed line is the locus of points 
(ui; s), u2(1; s)], tanh 1 < s < coth1 


tanh | 


Fig. 7.2. The graph of #(s) and convergence of Newton’s method 


From the graph o 


f p(s) (see Fig. 7.2), in particular, the convexity of ¢, it follows 


that Newton’s method converges precisely if 


tanhl <s <s°, (7.63) 
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where s° is such that 


0 

v= Oy tanh 1. (7.64) 
$'(s°) 

It can be shown (see Ex. 5(a)) that s° < coth 1, so that the convergence of Newton’s 

method imposes an additional restriction on the choice of s. 


7.2.3 Parallel Shooting 


To circumvent the difficulties inherent in ordinary shooting, as indicated in 
Examples of Sects. 7.2.1 and 7.2.2, one may divide the interval [a,b] into N 
subintervals, 

a=Xo < XxX] <X2 <0++ < XN) < xy =), (7.65) 


and apply shooting concurrently on each subinterval. The hope is that if the intervals 
are sufficiently small, not only does the appropriate boundary value problem have a 
unique solution, but also the solution is not given a chance to grow excessively. We 
say “appropriate” boundary value problem since in addition to the two boundary 
conditions, there are now also continuity conditions at the interior subdivision 
points. This, of course, enlarges considerably the problem size. To enhance the 
prospects of success, it is advisable to generate the subdivision (7.65) dynamically, 
as described further on, rather than to choose it artificially without regard to the 
particular features of the problem at hand. 

To describe the procedure in more detail, consider the boundary value problem 
for a general nonlinear system, 


= f(x,y), asxsh; Ayla tByb)=y. (7.66) 


Let hy = X, —Xn-1,N = 1,2,..., N, and define 
Yn(t) = Y (Xn-1 - thn), O<t<l. (7.67) 
Clearly, 


dy, 
dt 


= hiay ast 3 thy) = hin F (%n—1 + thy, Yn(t)), O<t<l. 


Thus, by letting 


Sn (t,Z) = hn f (Xn-1 + thy,Z), (7.68) 
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we have for the (vector-valued) functions y, the following system of, as yet 
uncoupled, differential equations 


d n 
= = frlt, Yn), N=1,2,...,N; O<t<1. (7.69) 


The coupling comes about through the boundary and interface (continuity) 
conditions 


Ay,(0) + Byy(1) = y, 
Yn+1(0) — yx) = 0, n = 1,2,...,N—-1. (7.70) 


Introducing “‘supervectors” and “‘supermatrices,” 


yilt) fit. yi) ( 
Y(t) = : , FCtY)= ; y DS] els 
yn(t) fu (t. yn) 0 
A 0 0 0 0 0 0 B 
0 TF O 0 -I 0 0 0 
p-|9 oO I 0 . O= 0 -T!I 0 0 
0 0 0 I 0 0 -1 0 
(7.71) 
we can write (7.69) and (7.70) compactly as 
dY 
on ee O0<rt<1; PY()+QY()=T; (7.72) 


this has the same form as (7.66) but is much bigger in size. Parallel shooting consists 
of applying ordinary shooting to the big system (7.72). Thus, we solve on0 <t < 1 


ot = F(t,U), UO)=S, (7.73) 


to obtain U(t) = U(t; S) and try to determine the vector § € R'@ such that 
®(S)= PS + QU(1;S)-T =0. (7.74) 
If we use Newton’s method, this is done by the iteration 


SOM =ASMLA,, 
[P+ QV(1;S™)]A, = -—8(S™) 


v=0,1,2,..., (7.75) 
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ui(t) ua(t) us(t) un (t) 
/ A G ¢ | | ils 
a= 2% Ly L9 Cys ty=b 
uy, (0)=s, Uy (0)= 8» u3(0)= 83 un (0)=sy 
Fig. 7.3. Parallel shooting (one-sided) 
where 
aU 
Vit; S) = —(:S), O<t<1. (7.76) 
as 
If we partition U? = [ut,..., uy], S' = [s},..., sy] in accordance with (7.69), 


then, since (7.69) by itself is uncoupled, we find that u, = u,,(t;s,) depends only 
on s,. As a consequence, the “big” Jacobian V in (7.76) is block diagonal, 


yy 0 =: O 
0» 

ES. i » Wilts Sn) = = (t:8n), n=1,2,...,N, 
0 0 = wy 


and so is the Jacobian Fy (t, U) of F in (7.73). This means that U in (7.73) and V 
in (7.76) can be computed by solving uncoupled systems of initial value problems 
on0<t <1, 


du, 


= fnlt,Un), Un(O) = 8p 
dr n=1,2,...,N. (7.77) 


dv, Of 
— t,Un)Vn, Vn(0 =I, 
Fe = FE stalin ¥4(0) 


This can be done in parallel — hence the name “parallel shooting.” 

The procedure may be summarized schematically as in Fig. 7.3. Alternatively, if 
N is even, one may shoot both forward and backward, as indicated in Fig. 7.4 for 
N = 4. This reduces the size of the big system by one-half. 

Even though multiple shooting can be quite effective, there are many practical 
problems associated with it. Perhaps the major problems are related to obtaining 
good initial approximations (recall, we have to choose a reasonable vector S in 
(7.73)) and to constructing a natural subdivision (7.65). With regard to the latter, 
suppose we have some rough approximation n(x) ~ y(x) ona < x < Db. 
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uy(t) uy(t) ug(t) ug(t) 
x 
a=£5 ry iD) Xs t=b 
u;(0)= 8; Uy(0)= 82 
Fig. 7.4 Parallel shooting (two-sided), N = 4 
N(x) 
x 
a= 2x Ly Lo 23 v4 b 
Fig. 7.5 Construction of subdivision 
Then, taking x)» = a, we may construct x1, X2,... recursively as follows. For 
i =0,1,2,... solve the initial value problem 
dz; 
aa Sf (x,21), ZO) = (Xi), x = Xi, (7.78) 


and take for x;+; the smallest x > x; such that (say) ||z;(x)|| = 2||4(x)||. In other 
words, we do not allow the solution of (7.78) to increase more than twice in size; 
see Fig. 7.5. Thus, (7.78) are strictly auxiliary integrations whose sole purpose is to 
produce an appropriate subdivision of [a, b]. 

There are circumstances in which reasonable initial approximations may be 
readily available, for example, if one solves the given boundary value problem (7.66) 
by a homotopy method. Basically, this means that the problem is embedded in a 
family of problems, 


d 
Py: S = folx.y), aS x <b; Ay(a) + By) = y, 
where @ is a (usually physically meaningful) parameter, say, in the interval 0 < 
@ < 1. This is done in such a way that for @ = 0 the solution of P,, is easy, and 
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form = 1, we have fi,(x, y) = f (x, y). One then solves a sequence of boundary 
value problems P,,,, corresponding to parameter values 0 = @y < @| < @2 <--+- < 
@m = 1 which are chosen sufficiently close to one another so that the solution of 
Poy differs relatively little from the solution of Poy ,- When » = a), one takes for 
n(x) the easy solution of P.,), constructs the appropriate subdivision as before, and 
then solves P,,, by parallel shooting based on the subdivision generated. One next 
solves P,,,, using for n(x) the solution of P,,,, and proceeds as with P,,,. Continuing 
in this way, we will eventually have solved P,,, = P1, the given problem (7.66). 
Although the procedure is rather labor-intensive, it has the potential of providing 
accurate solutions to very difficult problems. 


7.3 Finite Difference Methods 


A more static approach toward solving boundary value problems is via direct 
discretization. One puts a grid on the interval of interest, replaces derivatives by 
finite difference expressions, and requires the discrete version of the problem to 
hold at all interior grid points. This gives rise to a system of linear or nonlinear 
equations for the unknown values of the solution at the grid points. 

We consider and analyze only the simplest finite difference schemes. We assume 
throughout a uniform grid, say, 


a Xp << Xi <X2< ++: XN <X +Xn a+nh,h ri 7.79 
0 1 2) N N+1 N I 


and we continue to use the terminology of grid functions introduced in Chap. 5, 
Sect. 5.7. 


7.3.1 Linear Second-Order Equations 


We consider the Sturm—Liouville problem (cf. (7.32) and (7.33)) 
Ly =r(x), a<x <b, (7.80) 


where 
Ly :=—y" + p(x)y’ +4a)y, (7.81) 


with the simplest boundary conditions 


y(a) =a, y(b) = B. (7.82) 
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If p, g, r are continuous and gq positive on [a,b], then (7.80) and (7.82) by 
Corollary 7.1.1 have a unique solution. Under these assumptions, there are positive 
constants p, g, and g such that 


IpP@)| <p, 0<q<q(x) <q fora<x<b. (7.83) 


A simple finite difference operator, acting on a grid function u € T’,[a, b], which 
approximates the operator £ in (7.81) is 


_ Until _ 2uUn + Un-1 Un+1 — Un-1 
h2 2h 
WH 12). N (7.84) 


(Li U)n — 


+ P(X) + q(Xn)Un, 


For any smooth function v on [a,b], we define the grid function of the truncation 
error T), by 


(Thv)n = (Lav)n — (Lv)(xXn), n = 1,2,..., N. (7.85) 


If v = y is the exact solution of (7.80) and (7.82), this reduces to an earlier definition 
in Chap. 6, Sect. 6.1.2. By Taylor’s formula one easily finds that for v € C*[a, b], 


h2 
(Thvn = = 35 [v (&) — 2p)" (G2), €1.& € [en — A. Xn +A], (7.86) 


and more precisely, if v € C®[a, b], since L;, is an even function of h, 


2 
(Thv), = — “ [v (xn) — 2p(xn)v" (Xn)] + O(h4), h > 0. (7.87) 


In analogy to terminology introduced in Chap. 5, we call the difference operator 
Ly, stable if there exists a constant M independent of / such that for / sufficiently 
small, one has for any grid function v = {v,} 


IIVlloo $ M{max(|vol, vw +il) + [|Lavlloo}, v € Vila, d], (7.88) 


where ||V||oo = Maxo<n<n-+1 |Vn| and ||LpV|loo = Maxi<n<y |(LpVv)n|. The follow- 
ing theorem gives a sufficient condition for stability. 


Theorem 7.3.1. [f hp < 2, then Lj, is stable. Indeed, (7.88) holds for M = 
max(1, 1/q). (Here p,q are the constants defined in (7.83).) 


Proof. From (7.84) one computes 


5h? (Lav)n = AnVn—-1 + baVn + CnVn+15 (7.89) 
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where 
Ape xh (xn) }, 
an = in Xn 
-) P 
1 2 
by =14+ 5 (Xn), 
1 
Ch = =) [ a 5 hoc). (7.90) 


Since, by assumption, + 5 h|p(xn)| < sh hp < 1, we havea, <0, cy, < 0, and 


1 1 
Jal-+lol= 5 [t+ shrew] + 5[1- Sho ]=1 aan 


Also, 
by = 1+ 4507q. (7.92) 


Now by (7.89), we have 
1, 
biVn = —AyVn—1 — CaVn+1 + 2 h (Lav)n, 
which, upon taking absolute values and using (7.91) and (7.92), yields 
lo I 55 
1+ 5h qd ) |Ynl < [IVlloo + 5h WLaviloo, m= 1,2,...,N. (7.93) 


We distinguish two cases. 


Casel: — ||v\loo = |Vng|, 1 < 20 < N. Here (7.93) gives 


1 1 
(1 + ) ira) IVnol S [Val + 2 h?||Livlloo; 


hence i 
nol S = 1Livlloo . 
af 
and (7.88) follows since by assumption 7 <M. 
Case Il: |\vlloo = |Vno|, 20 = 0 or no = N +1. In this case, (7.88) is trivial, since 
M > 1. o 


The method of finite differences now consists of replacing (7.80) and (7.82) by 


(Lawn =7r(%), n=1,2,...,N3 uo =a, unyi = B. (7.94) 
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In view of (7.89), this is the same as 
1 
AnUn—1 + byUn + CnUn+1 = 5 rn), a= 1,2, sees N, uy = a, uUuN+1 = B, 


and gives rise to the system of linear equations 


by cy 0 uy r (x1) a\a 
a2 bo (op) uz r (x2) 0 
=--h’ : —}| : |. (7.95) 
ay—1 bn-1 Cy-1 uN—1 r(xn—1) 0 
0 an bn uN r(xy) cnB 


The matrix of the system is tridiagonal and strictly diagonally dominant, since 
ladn| + |cn| = 1 and b, > 1 by (7.92). In particular, it is nonsingular, so that 
the system (7.95) has a unique solution. (Uniqueness follows also from the stability 
of £;,: the homogeneous system with r(x,) = a = B = 0can have only the trivial 
solution, since Lz,u = 0, uy = uy +1 = 0 implies ||u||o0 = 0 by (7.88).) 

Now that we know that the difference method defines a unique approximation, 
the next question is: how good is it? An answer to that is given in the next two 
theorems. 


Theorem 7.3.2. [f hp < 2, then 


lu — ylloo S$ M||Thylloo, M = max(1, 1/q), (7.96) 


where u = {u,} is the solution of (7.95), y = {yn} the grid function induced by the 
exact solution y(x) of (7.80) and (7.82), and T;,y the grid function of the truncation 
errors defined in (7.85), where v = y. If y € C*[a, b], then 


lu — Ylloo < FRM (ly lloo + 2D Ily™ loo), (7.97) 


where ||y™ log = Maxg<x<p |y(x)|, k = 3,4. 
Proof. From 
(Lau)n = r(Xn), uj =, UN4+1 = B, 


(Ly) (Xn) = r (Xn), y(xo) = a, y(xn41) = B, 
we obtain, letting v, = up» — y(Xn), 


(Lav)n (Lru)n = (Liy)n 
= 7 (Xn) — (Ly) On) + Lay)n — (LY) On) 
= (Xn) = r(Xn) = (Thy)n 


—(Thy)ns 
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so that 

ILnvlloo = [Tr lloo. (7.98) 
By Theorem 7.3.1, £;, is stable with the stability constant M as defined in (7.96). 
Since vo = vw+1 = 0, there follows ||v|loo < M||Lavlloo, which in view of (7.98) 


and the definition of v is (7.96). The second assertion (7.97) follows directly from 
(7.86). Oo 


In the spirit of Chap. 5, Sect.5.7.3 and Chap. 6, Sect. 6.3.4, the result (7.97) of 
Theorem 7.3.2 can be refined as follows. 


Theorem 7.3.3. Let p,g € C?[a,b], y € C°®[a, b], andhp < 2. Then 
Un — Y(Xn) = he(Xn) + O(h*), n =0,1,...,N +1, (7.99) 
where e(x) is the solution of 
Le=O60(x), a<x<b; e(a)=0, e(b) = 0, (7.100) 
with 


A(x) = 5 ya) = 2pla)y"] (7.101) 


Proof. We first note that our assumptions are such that 6 € C?[a,b], which, by 
(7.100), implies that e ¢ C*[a, DJ. 
Let 


fe} 


1 
Vn = h2 (Un _ y(%n)). 


We want to show that 
Vn = €(Xn) + O(N). (7.102) 
As in the proof of Theorem 7.3.2, we have 
° 1 
(Liv)n = —F5 (ThY)n- 

By (7.87) with v = y, this gives 

(Lrv)n = O(Xn) + O(h?). (7.103) 
Furthermore, 


(Lne)n = (Le) (%n) + (Lne)n — (Le) (Xn) = 0%) + (The)n, (7.104) 
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by (7.100) and the definition (7.85) of truncation error. Since e € C*[a, b], we have 
from (7.86) that 


(The)n = O(n’). (7.105) 
Subtracting (7.104) from (7.103), therefore, yields 
(Lnv)n = O(h?), where v, = i e(Xn). 


Since vo = vy+1; = 0 and Ly is stable, by assumption, there follows from the 
stability inequality that |v,| < M||Livlloo = O(h7), which is (7.102). Oo 


Theorem 7.3.3 could be used as a basis for Richardson extrapolation (cf. Chap. 3, 
Sect. 3.2.7). Another application is the method of difference correction due to L. Fox. 
A “difference correction” is any quantity E,, such that 


En =e(%n)+ O(h’), n=1,2,...,N. (7.106) 
It then follows from (7.99) that 
Un — WWE, = y(Xn) a O(h'); (7.107) 
that is, a, = u, —h?E, is an improved approximation having order of accuracy 
O(h*). Fox’s idea is to construct a difference correction E,, by applying the basic 
difference method to the boundary value problem (7.100) in which 6(x,,) is replaced 
by a suitable difference approximation O,,: 
(LnaE)n = On, n= 1,2,...,N; Eo =0, Evy, = 0. (7.108) 
Letting v, = E, — e(x,), we then find 
(Lav)n = (LiE)n = (Lne)n = 0, = O(Xn) a O(h’), 
by virtue of (7.108), (7.104), and (7.105). Since v9 = vy+4, = 0, stability then 


yields 
[Yn] = |En — e(%n)| < MO — Aloo + O(h”), 


so that for (7.106) to hold, all we need is to make sure that 
©, — O(x,) = O(h?), n=1,2,...,N. (7.109) 


This can be achieved by replacing the derivatives on the right of (7.101) by suitable 
finite difference approximations (see Ex. 10). 
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7.3.2 Nonlinear Second-Order Equations 


A natural nonlinear extension of the linear problem (7.80) and (7.82) is 
Ky =0, yla)=a, y(b)=8, (7.110) 
where 
Kye FT) (7.111) 


and f(x, y,z) is a given function of class C! defined on [a, b]x RxR, now assumed 
to be nonlinear in y and/or z. In analogy to (7.83) we make the assumption that 


lfl<P, O<q<f, <q on[a,b]xRxR. (7.112) 


Then, by Theorem 7.1.2, the problem (7.110) has a unique solution. 
We use again the simplest difference approximation K;, to K, 


Un+1 — 2Un + Un-1 Un+1 — Un-1 
(Kit)n = — = +f (Xn, “AS ), 7.113) 
and define the truncation error as before by 
(ThV)n = (Kav)n — (Kv)0Qm), m= 1,2,...,N, (7.114) 


for any smooth function v on [a,b]. If v € C*[a,b], then by Taylor’s theorem, 
applied at x = X,, y = V(X»), Z = V (Xn), 


(Thy)n = - [ a — tr) van) 
V(X +h) — Vm — 


h 
‘ff (». se: ’) ~ fn, Ven), vn) 


V(X, + A) — v(x, — A) 
2h 


2h 
2 


_ v(E1) + fe(%n, VOCn), Zn) 


12 ~ oan) 


7 h? (4) _ h? my 
=> rp) Vv (&1) + Fn; V(Xn), Zn) 6 v (2), 


where & € [x, —h, x, + A],i = 1,2, and Z, is between v’(x,) and (2h)~! [v(x, + 
h) — v(x, — h)]. Thus, 


h2 
(Thv)n = — 5 [v (1) — 2f. Cin Vn), Env” &)]. (7.115) 
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Since Ky is nonlinear, the definition of stability needs to be slightly modified. We 
call Ky, stable if for h sufficiently small, and for any two grid functions v = {v,}, 
w = {wy}, there is a constant M such that 


I|v — Wlloo < M{max(|vo — wo, |vv-+1 — ww 4il) + Kav — Kiwlloo}, 


v,w €T,|a, db]. (7.116) 


If K;, is linear, this reduces to the previous definition (7.88), since v — w, just like v, 
is an arbitrary grid function. 


Theorem 7.3.4. If hp < 2, then K;, is stable. Indeed, (7.116) holds with M = max 
(1,1/q). (Here, p, q are the constants defined in (7.112).) 


Proof. Much the same as the proof of Theorem 7.3.1 See Ex. 11. oO 


The method of finite differences now takes on the following form: 
(Kpwn =0, n=1,2,...,N; uo=a, ungi = 8. (7.117) 


This is a system of N nonlinear equations in the N unknowns uw, u2,..., uy. We 
show shortly that under the assumptions made, the system (7.117) has a unique 
solution. Its error can be estimated exactly as in Theorem 7.3.2 Indeed, by applying 
the stability inequality (7.116) with v = u and w = y, where uw is the grid function 
satisfying (7.117) and y the grid function induced by the exact solution y(x) of 
(7.110), we get, with M as defined in Theorem 7.3.4, 


A 


||u — ylloo S$ M||Knu— Kiylloo = M||Knylloo 
M||Ky + (Kay —Ky)lloo 

= M||Kiy —Ky|loo 

= M||TiY|loo, 


which is (7.96). The same error estimate as in (7.97) then again follows immediately 
from (7.115) and the first assumption in (7.112). 

In order to show that (7.117) has a unique solution, we write the system in fixed 
point form and apply the contraction mapping principle. It is convenient to introduce 
a parameter w in the process — a “relaxation parameter” as it were — by writing 
(7.117) equivalently in the form 


1 1 
u=gtu), gu) =u- Too 5h Knu (o 4-1), 


Ug = QA, UN+1 = B. (7.118) 


Here we think of g as a mapping RN*+? — R%*?, by defining go(u) = a, 
2nv+1(u) = B. We want to show that g is a contraction map on R‘*? if h satisfies 
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the condition of Theorem7.3.4 and w is suitably chosen. This then will prove 
(cf. Theorem 4.9.1) the existence and uniqueness of the solution of (7.118), and 
hence of (7.117). 

Given any two grid functions v = {v,}, w = {w,}, we can write by Taylor’s 
theorem, after a simple calculation, 


8n (v) — &n (w) = [an (Vn—1 —Wr-1) + +o- bn) On — Wn) 


l+o 
+ Crnt1—Wntd], L<n<WN, (7.119) 


whereas forn = 0 orn = N + 1 the difference on the left, of course, is zero; here 


1 1 
n= {1 hf, n> Yn en) | 5 
a Al ar hfe(Zns Vn 2) 
Lg — 
Db, =I1+ a Syn, Vne Zn)s 
1 1 
1—<hf, n> Ynoen) | 5 
c Al 5 A(x Fnt)| 


with ¥,,, Z, Suitable intermediate values. Since hp < 2, we have 
an >0, Ch >0, Gy +c, = 1. (7.120) 


Assuming, furthermore, that 


1 
wz shG, (7.121) 


we have 
Lge a 
l+o-b,>lt+toa- La ”q =o, q >= 0. 
Hence, all coefficients on the right of (7.119) are nonnegative, and since 
ve: 1 io 
0<l+a-b, <1l+a- 1+ 5h qd Sasi q; 


we obtain, upon taking norms and noting (7.120), 


1 
eu(0) — Gol)] So (an +o 5 1g + on) I Wl 


l+o 
1 


I 
= as (1 0 52a) Ir 
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that is, 
a heg 
Ig) -—sW)lloo < ¥(@) |lv—wllo, yw) := 1- ea" (7.122) 
+@ 
Clearly, y(w) < 1, showing that g is a contraction map on R‘*?, as claimed. 


In principle, one could apply the method of successive iteration to (7.118), and 
it would converge for arbitrary initial approximation. Faster convergence can be 
expected, however, by applying Newton’s method directly on (7.117); for details, 
see Ex. 12. 


7.4 Variational Methods 


Variational methods take advantage of the fact that the solution of important types 
of boundary value problems satisfies certain extremal properties. This then suggests 
solving the respective extremal problems — at least approximately — in place of the 
boundary value problems. This can be done by classical methods. We illustrate the 
method for a linear second-order boundary value problem with simplified (Dirichlet) 
boundary conditions. 


7.4.1 Variational Formulation 


Without restriction of generality (cf. (7.35)), we can assume that the problem is in 
self-adjoint form: 


Ly =r(x), ax<x<b; y@=a, y(b)=8, (7.123) 


where 
d d 
Ly = -— (po =) +q(x)y, a<x<b. (7.124) 
dx dx 


We assume p € C![a, b] and q, r continuous on [a, b], and 
P(x) = p>, q(x) > 0 on [a,b]. (7.125) 


Under these assumptions, the problem (7.123) has a unique solution (cf. 
Corollary 7.1.1). 

If €(x) is a linear function having the same boundary values as y in (7.123), then 
z2(x) = y(x) — £(x) satisfies Lz = r(x) — (LL)(x), z(a) = z(b) = 0, which is a 
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problem of the same type as (7.123), but with homogeneous boundary conditions. 
We may assume, therefore, that a = 6 = 0, and thus consider 


Ly =r(x), a<x<b; y(a)= y(b)=0. (7.126) 


Denoting by C?|a, b] the linear space C?[a, b] = {u € C?[a, b]: u(a) = u(b) = 0}, 
we may write (7.126) in operator form as 


Ly =r, ye Cola, bl]. (7.127) 


Note that £ : C?[a,b] — C a, b] is a linear operator. It is convenient to enlarge the 
space Cc. somewhat and define 


Vo = {ve C[a,b]: Vv’ piecewise continuous 


and bounded on [a,b], v(a) = v(b) = O}. 


On Vo we can define the usual inner product 


b 
(u,v) =f u(x)v(x)dx, u,v € Vo. (7.128) 


a 


Theorem 7.4.1. The operator L in (7.124) is symmetric on Cz {a, b| relative to the 
inner product (7.128); that is, 


(Lu, v) = (u, Lv), all u,v € Cofa, d). (7.129) 


Proof. Use integration by parts to obtain 
b 
(Luv) =f E(oouly’ + aoulveaas 


b 
= ~(pu'v)|? + / [pow @)v @) + auOvonldx 


b 
/ [p(x)ul (xv (x) + (@)u(xv()]Jdx. 


Since the last integral is symmetric in u and v, it is also equal to (Lv, u), which in 
turn, by the symmetry of (- ,-), proves the theorem. Oo 


Note that the last integral in the proof of Theorem 7.4.1 is defined not only on 
C3 [a, b], but also on Vo. It suggests an alternative inner product, 


b 
[u,v] := | [p(x)u'(x)vV' (x) + q(x)u(x)v(x)]dx, u.v € Vo, (7.130) 


a 
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and the proof of Theorem 7.4.1 shows that 
(Lu,v) = [u,v] if we Cola, db], ve Vo. (7.131) 
In particular, if u = y is a solution of (7.126), then 
ly,v] =(r,v), all ve Vo; (7.132) 


this is the variational, or weak, form of (7.126). 


Theorem 7.4.2. Under the assumptions made on p,q, and r (cf. (7.125)), there 
exist positive constants c and C such that 


cllull2, < [uu] <@llu' ||. all ue Vo. (7.133) 


In fact, 


=Z: € = ©—a)llPlloo + (b — 4)" ldlloo. (7.134) 


i | 


Proof. For any u € Vo, since u(a) = 0, we have 
u(x) =f ul(t)dt, x € [a,b]. 
By Schwarz’s inequality, 
x x b 
ex) < | iar [ [u'(t)/’'dt < -a) | [u'(t))’dt, x € [a,b], 
and, therefore, 
b 
Iwi, <a) f WP = -a)"Wie. 135) 
Using the assumption (7.125), we get 
b b 
roy]2 2 Iy)]2 £ 2 
tual = f (ew GP + qeonPeapar = pf WeoPar > 5 Iles, 


where the last inequality follows from the left inequality in (7.135). This proves the 
lower bound in (7.133). The upper bound is obtained by observing that 


[uu] < (b — a) || Plloo lle'lldo + © — a)lldlloollulloc SE llu'lleo 


where (7.135) has been used in the last step. oO 
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We remark that (7.133) implies the uniqueness of solutions of (7.126). In fact, if 
Ly =n, Ly =r, Hy" ] Cia, 5), 
then L(y — y*) = 0, hence, by (7.131) and (7.133), 
0=(Ly-y*) y-y) =Ly—y*, vy -y*] 2 lly "lle, 


and it follows that y = y*. 


7.4.2 The Extremal Problem 


We define the quadratic functional 

F(u) := [u,u] —2(7,u), ue Vo, (7.136) 
where r is the right-hand function in (7.126). The extremal property for the solution 
y of (7.127) is expressed in the following theorem. 
Theorem 7.4.3. Let y be the solution of (7.127). Then 


F(u)> F(y), all ue Vo, u# y. (7.137) 


Proof. By (7.132), (r, u) = [y, uJ, so that 


F(u) = [u,u] — 27,4) = [uu] — 2[y,u] + yy] — ly. 9] 
=ly-4y—ul—[y.y] > -ly, 9], 
where strict inequality holds in view of (7.133) and y — u ¥ 0. On the other hand, 
since [y, y] = (Ly, y) = (7, y), by (7.131), we have 
F(y)=(y. 91-26) =@y)-26y) =-Gy) = —-b, I], 
which, combined with the previous inequality, proves the theorem. oO 


Theorem 7.4.3 thus expresses the following extremal property of the solution 
of (7.127): 


F(y) = min F(u). (7.138) 
ueVo 


We view (7.138) as an extremal problem for determining y, and in the next sec- 
tion solve it approximately by determining a function ug from a finite-dimensional 
subset S C Vo that minimizes F(u) on S. In this connection, it is useful to note the 
identity 


[y-uy—u]J= Fw +p, yl], wed, (7.139) 


satisfied by the solution y, which was established in the course of the proof of 
Theorem 7.4.3 
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Let S C Vo be a finite-dimensional subspace of Vo and dim S = n its dimension. 
Let 1, U2,...,U, be a basis of S, so that 


wéS ifandonlyif w=) &m, & ER. (7.140) 


v=1 
We approximate the solution y of (7.138) by us € S, which satisfies 


F(us) = min F(w). (7.141) 
ueS 


Before we analyze the quality of the approximation us ~ y, let us explain the 
mechanics of the method. 
We have, for any u € S, 


F(u) = EyUy, yikes -2(n. 06] 
v=1 p=1 v=1 
= DO bo, uplf& —2 0 wb. 
vju=1 v=1 
Define 

[ui,ui]  [u1,u2] +++ (ui, un] & (r, 41) 
ee ap a ee ee) 
[Uns Ui] [Un,u2] ++: [Un Un] En (rT, Un) 


(7.142) 


In early applications of the method to structural mechanics, and ever since, the 
matrix U is called the stiffness matrix, and p the load vector. In terms of these, 
the functional F can be written in matrix form as 


F(u) = &'U& —2p'é, & ER". (7.143) 


The matrix U is not only symmetric, but also positive definite, since "UE = 
[u,u] > 0, unless u = 0 (e., € = 0). 
Our approximate extremal problem (7.141) thus takes the form 


o(§) = min, 
b(&) := €'UE—2p'&, EER’, (7.144) 
an unconstrained quadratic minimization problem in R”. Since U is positive 


definite, the problem (7.144) has a unique solution & given by the solution of the 
linear system 
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UE =p. (7.145) 
It is easily verified that 


o(é) > o(&), all & ER", & £E; (7.146) 


indeed, since p = U&, and thus p? = ut a é'u, we have 


o(é) = &TUE —2p"§ = E7UE —28 UE 
= 8-28 Ue+# UE -E UE 
= (& —&)"U(é —&) + o(8), 


where _8'vé a -£"p = E'p —2p'é = é'ué —2p'& = $(&) has been used in 
the last step. From this, (7.146) follows immediately. Thus, 


us=) uy, &=[&.6,....€)]° where U& = p. (7.147) 


v=1 


In practice, the basis functions of S' are chosen to have small support, which 
results in a matrix U having a band structure. 

It is now straightforward to establish the optimal approximation property of us 
in the norm [-, - ]; that is, 


[y — us, y — us] = min[y —u, y — ul]. (7.148) 


Indeed, by (7.139) and (7.141), the left-hand side is equal to F(us) + [y, y] = 
minyes{F(u) + [y, y]}, which, again by (7.139), equals the right-hand side of 
(7.148). 

The approximation property (7.148) gives rise to the following error estimate. 


Theorem 7.4.4. There holds 
ly —uslloo < Ve/c lly’ —w'|loo, all we S, (7.149) 


where ¢ and c are the constants defined in (7.134). In particular, 


lly —Uslloo = Ve/e inf Iv’ — u'lloo- (7.150) 


Proof. By (7.133) and (7.148), we have 
clly —usll < ly —us,y — us] < ly —uy—u] <7 ly’ —w'l3, 


from which (7.149) follows. oO 


7.5 Notes to Chapter 7 509 


The point of Theorem 7.4.4 is that, in order to get a good error bound, we have to 
use an approximation process y ~ u, u € S, which approximates the first derivative 
of y as well as possible. Note that this approximation process is independent of the 
one yielding us; its sole purpose is to provide a good error bound for us. 


Example. Let A be a subdivision of [a, 5], say, 
A= XxX, < xX. <X3 <0 < Xp < xX, =), (7.151) 
and take (see Chap. 2, Sect. 2.3.4 for notation) 
S ={s €S3(A): s(a) = s(b) = 0}. (7.152) 


Here S is a subspace not only of Vo, but even of C3 [a,b]. Its dimension is easily 
seen to be n. Given the solution y of (7.127), there is a unique Scomp! € S such that 


Scompi (Xi) = yi), §=1,2,...,n, 


: ; ; , (7.153) 
Scompi (2) = y (a), Scuntal 2) = y (b), 
the “complete cubic spline interpolant” to y (cf. Chap.2, Sect.2.3.4(b.1)). From 
Chap. 2, (2.147), we know that 


ll Scompt ~ Y' lloo = a [AP Ix lloo if y € C4la, b}. 
Combining this with the result of Theorem 7.4.4 (in which u = Scompi), we get the 
error bound 


lly —uslloo < 3 Ve/e JAP IY loo = OAL), (7.154) 


which is one order of magnitude better than the one for the ordinary finite difference 
method (cf. (7.97)). However, there is more work involved in computing the stiffness 
matrix (many integrals!), and also in solving the linear system (7.145). Even with 
a basis of S that has small support (extending over at most four consecutive 
subintervals of A), one still has to deal with a banded matrix U having bandwidth 
7 (not 3, as in (7.95)). 


7.5 Notes to Chapter 7 


Background material on the theory of boundary value problems can be found 
in most textbooks on ordinary differential equations. Specialized texts are Bailey 
et al. [1968], Bernfeld and Lakshmikantham [1974], and Agarwal [1986]; all three, 
but especially the first, also contain topics on the numerical solution of boundary 
value problems and applications. An early book strictly devoted to numerical 
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methods for solving boundary value problems is Fox [1990], an eminently practical 
account still noteworthy for its consistent use of “difference correction,” that is, the 
incorporation of remainder terms into the solution process. Subsequent books and 
monographs are Keller [1992], Na [1979], and Ascher et al. [1995]. The books by 
Keller and Na complement each other in that the former gives a mathematically 
rigorous treatment of the major methods in use, with some applications provided in 
the final chapter, and the latter a more informal presentation intended for engineers, 
and containing a wealth of engineering applications. Na’s book also discusses 
methods less familiar to numerical analysts, such as the METOJ, TIPOTOHKMU 
developed by Russian scientists, here translated as the “method of chasing,” and 
interesting methods based on transformation of variables. The book by Ascher et al. 
is currently the major reference work in this area. One of its special features is an 
extensive discussion of the numerical condition and associated condition numbers 
for boundary value problems. In its first chapter, it also contains a large sampling of 
boundary value problems occurring in real-life applications. 

Sturm-—Liouville eigenvalue problems — both regular and singular — and their 
numerical solution are given a thorough treatment in Pryce [1993]. A set of 60 test 
problems is included in one of the appendices, and references to available software 
in another. 


Section 7.1.1. The third Example is from Bailey et al. [1968, Chap. 1, Sect. 4]. 


Section 7.1.2. The exposition in this section, in particular, the proof of 
Theorem 7.1.2, follows Keller [1992, Sect. 1.2]. 


Section 7.1.3. An example of an existence and uniqueness theorem for the general 
boundary value problem (7.38) is Theorem 1.2.6 in Keller [1992]. The remark at the 
end of this section can be generalized to “partially separated” boundary conditions, 
which give rise to a “reduced” superposition method; see Ascher et al. [1995, 
Sect. 4.2.4]. 


Section 7.2. In this section, we give only a bare outline and some of the key 
ideas involved in shooting methods. To make shooting a viable method, even for 
linear boundary value problems, requires attention to many practical and technical 
details. For these, we must refer to the relevant literature, for example, Roberts and 
Shipman [1972] or, especially, Ascher et al. [1995,Chap.4]. The latter reference 
also contains two computer codes, one for linear, the other for nonlinear (nonstiff) 
boundary value problems. 

Shooting basically consists of solving a finite-dimensional system of equations 
generated by solutions of initial value problems, which is then solved iteratively, for 
example, by Newton’s method. Alternatively, one could apply Newton’s method, or 
more precisely, the Newton—Kantorovich method, directly to the boundary value 
problem in question, considered as an operator equation in a Banach space of 
smooth functions on [a, b] satisfying homogeneous boundary conditions. This is the 
method of quasilinearization originally proposed by Bellman and Kalaba [1965]. 

Yet another approach is “invariant imbedding,” where the endpoint b of the 
interval [a,b] is made a variable with respect to which one differentiates to obtain 
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an auxiliary nonlinear initial value problem of the Riccati type. See, for example, 
Ascher et al. [1995, Sect. 4.5]. A different view of invariant imbedding based on a 
system of first-order partial differential equations and associated characteristics is 
developed in Meyer [1973]. 


Section 7.2.1. Instead of Newton’s method (7.45) one could, of course, use other 
iterative methods for solving the equation #(s) = 0 in (7.25), for example, a fixed 
point iteration based on the equivalent equation s = s — m¢(s),m # 0, which is 
analyzed in Keller [1992, Sect. 2.2], or Steffensen’s method (cf. Chap. 4, Sect. 4.6, 
(4.66)). 

For the origin of the second Example in this section, see Troesch [1976]. The 
analytic solution of the associated initial value problem is given, for example, in 
Stoer and Bulirsch [2002, pp. 514-516]; also cf. Ex. 8(c),(d). 


Section 7.2.2. It is relatively straightforward to analyze the effect, in superposition 
methods, of the errors committed in the numerical integration of the initial 
value problems involved; see, for example, Keller [1992,Sect.2.1] and Ascher 
et al. [1995, Sect. 4.2.2]. More important, and potentially more disastrous, are the 
effects of rounding errors; see, for example, Ascher et al. [1995, Sect. 4.2.3]. 

In place of (7.52), other iterative methods could be used to solve the equation 
o(s) = 0 in (7.41), for example, one of the quasi-Newton methods (cf. Chap. 4, 
Notes to Sect. 4.9.2), or, as in the scalar case, a fixed point iteration based on s = 
s—M(s), with M a nonsingular matrix chosen such that the maps } s—M@(s) 
is contractive. 

The Example in this section is from Morrison et al. [1962], where the term 
“shooting” appears to have been used for the first time in the context of boundary 
value problems. 


Section 7.2.3. Parallel shooting is important also for linear boundary value prob- 
lems of the type (7.38), (7.42), since without it, numerical linear dependencies may 
be developing that could render the method of simple shooting useless. There are 
various versions of parallel shooting, some involving reorthogonalization of solution 
vectors; see Ascher et al. [1995, Sects. 4.3 and 4.4]. For a discussion of homotopy 
methods, including numerical examples, see Roberts and Shipman [1972, Chap. 7]. 


Sections 7.3.1 and 7.3.2. The treatment in these sections closely follows 
Keller [1992, Sects.3.1 and 3.2]. Maintaining second-order accuracy on 
nonuniform grids is not entirely straightforward; see, for example, Ascher 
et al. [1995, Sect. 5.6.1]. 

Extensions of the method of finite differences to linear and nonlinear systems 
of the type (7.38) can be based on the local use of the trapezoidal or midpoint 
rule. This is discussed in Keller [1992, Sect. 3.3] and Ascher et al. [1995, Sects. 5.1 
and 5.2]. Local use of implicit Runge—Kutta methods afford more accuracy, and so 
do the methods of extrapolation and “deferred corrections”; for these, see Ascher 
et al. [1995, Sects. 5.3, 5.4, 5.5.2 and 5.5.3]. 

Boundary value problems for single higher-order differential equations are often 
solved by collocation methods using spline functions; a discussion of this is given 
in Ascher et al. [1995, Sects. 5.6.2—5.6.4]. 
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Section 7.4. The treatment of variational methods in this section is along the lines 
of Stoer and Bulirsch [2002, Sect. 7.5]. There are many related methods, collectively 
called projection methods, whose application to two-point boundary value problems 
and convergence analysis is the subject of a survey by Reddien [1980] containing 
extensive references to the literature. 


Exercises and Machine Assignments to Chapter 7 


Exercises 


1. Consider the nonlinear boundary value problem (Blasius equation) 


my 


y +34yy” =0, 0<x<o, 


y(0)= yO) =0, y(o)= 1. 


(a) Letting y”(0) = A and z(t) = A-3y(A738) (assuming 4 > 0), derive an 
initial value problem for zon 0 < t < 00. 

(b) Explain, and illustrate numerically and graphically, how the solution of the 
initial value problem in (a) can be used to obtain the solution y(x) of the 
given boundary value problem. 


2. The boundary value problem 


" 


1 
y" =-- yy’, 0<x<1; y(0)=0, y() =1, 
XG 


although it has a singularity at x = 0 and certainly does not satisfy (7.112), has 
the smooth solution y(x) = 2x/(1 + x). 


(a) Determine analytically the s-interval for which the initial value problem 


aw 


1 
u’ =——u', 0< x <1; u(0)=0, W(O)=s 
x 


has a smooth solution u(x; 5) on0 <x < I. 
(b) Determine the s-interval for which Newton’s method applied to 
u(1; 5) — 1 = O converges. 


3. Use Matlab to reproduce the results in Table 7.2 and to prepare plots of the four 
solution components. 
4. Derive (7.56) and (7.62). 
De Let 
Ss 


d(s) = 


s cosh | — sinh 1 = 
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and s° be the solution of 


0 o(s°) 


s) — ——~ = tanhl 
$'(s°) 
(cf. the Example in Sect. 7.2.2, in particular (7.58), (7.62)). 
(a) Show that s° < coth 1. {Hint: consider what s° < f° means in terms of one 
Newton step at f° for the equation ¢(s) = 0.} 


(b) Use the bisection method to compute s° to six decimal places. What are 
appropriate initial approximations? 


6. Generalizing the Example in Sect. 7.2.2, let v > 0 and consider the boundary 
value problem 


dy yy 
2 O<x<1 
dy2 _ yl 
dx yy 


yi(0) = 1, yi) =e. 


(a) Determine the exact solution. 
(b) Solve the initial value problem 


d v+1 vp 
— |i) = eee ,0<x<i; |“|@=|! 
dx | uw us’ [uy Uy s 
in closed form. 
(c) Find the conditions on s > O guaranteeing that w;(x), u2(x) both remain 


positive and finite on [0, 1]. In particular, show that, as v — oo, the interval 
in which s must lie shrinks to the point s = 1. What happens when v —> 0? 


7. Suppose the Example in Sect. 7.2.2 is modified by multiplying the right-hand 
sides of the differential equation by A, and by replacing the second boundary 
condition by y,(1) = e~*, where A > 0 is a large parameter. 


(a) What is the exact solution? 
(b) What are the conditions on s for the associated initial value problem to have 
positive and bounded solutions? What happens as A — 00? As A > 0? 


8. The Jacobian elliptic functions sn and cn are defined by 
sn(u|k) = sing, cn(u|k) = cosy, 0<k <1, 


where ¢ is uniquely determined by 


a dd 
u= —____—_____. 
0 (1—k? sin? 6)2 
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ly 
(a) Show that K(k) := [(?" (1 — k? sin’ 6)~2d6 is the smallest positive zero 
of cn. 
(b) Show that 


© sn(ulk) = = cn(ulk) V1 — k2[sn(ulk) , 
d ee 
7, omul) = —sn(ulk) V1 — k2[sn(ulk)P? . 
u 
(c) Show that the initial value problem 
"= j)sinh(Ay), y(0) =0, y’(0)=s  (\s| < 2) 


has the exact solution 


s sn(Ax|k) > s? 
— sinh” {| — ———], k°=1-—. 
a= sain (G cen(Ax|k) 4 


Hence show that y(x;s), x > 0, becomes singular for the first time when 


X = Xoo, where 
_ Kk) 


oo = 
Xr 
(d) From the known expansion (see Radon [1950], p. 76 and Sect. 7, (I,)) 


4 1 
K(k) =1n + Q=) $<, k= 1, 


4 
In —1 
1-k2 i( V1—k? ) 
conclude that 


8 
Xoo ~ ~In — as s—0O. 


x j5| 


9. It has been shown in the first Example of Sect. 7.2.1 that the boundary value 
problem 
y"t+e% =0, 0<x<1, y(0)=y)=0 


has a unique solution that is nonnegative on (0, 1]. 


(a) Set up a finite difference method for solving the problem numerically. (Use 
a uniform grid x, = Wat? nm =0,155..5, N + 1, and the simplest of finite 
difference approximations to y”.) 

(b) Write the equations for the approximate vector u’ = [uj,u2,..., uy] in 
fixed point form uw = g(u) and find a compact domain D C R% such that 
gy: R‘ RY maps D into D and is contractive in D. {Hint: use the fact 
that the tridiagonal matrix A = tri[1,—2, 1] has a nonpositive inverse A~! 
satisfying ||A~! loo < a(N + 1)?.} 
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11. 
12. 


. Given the grid {x,} 


(c) Discuss the convergence of the fixed point iteration applied to the system 
of finite difference equations. 


ae of (7.79), construct finite difference approximations 0, 


for 0(x,), where 0(x) = bly) — 2p(x)y""(x)], such that ©, — O(x,) = 
O(h?) (cf.(7.109)). {Hint: distinguish the cases 2 < n < N—1andn = 1 
resp.n = N.} 

Prove Theorem 7.3.4. 

Describe the application of Newton’s method for solving the nonlinear finite 
difference equations K,u = 0 of (7.117). 


. Let A be the subdivision 


a= Xo <X1 <+++< Xn <Xn41 =b) 
and S = {s € S(A): s(a) = s(b) = 0}. 


(a) With [-,-] the inner product defined in (7.130), find an expression for 
[u,, u,,] in terms of the basis of hat functions (cf. Chap. 2, Ex. 72, but note 
the difference in notation) and in terms of the integrals involved; do this in 
a similar manner for p, = (7, u,), where (-, - ) is the inner product defined 
in (7.128). 

(b) Suppose that each integral is split into a sum of integrals over each 
subinterval of A and the trapezoidal rule is employed to approximate the 
values of the integrals. Obtain the resulting approximations for the stiffness 
matrix U and the load vector p. Interpret the linear system (7.145) thus 
obtained as a finite difference method. 


. Apply the approximate variational method of Sect. 7.4.3 to the boundary value 


problem 
-y"=r(x), O<x<1; yO=y() =0, 


using for S a space of continuous piecewise quadratic functions. Specifically, 
take a uniform subdivision 


A: O0=x0 < XxX] < X20 <0) < Xp <X, = 1, x,y = vA, 


of [0, 1] into n subintervals of length h = 1/n and let S = {s € S): s(0) = 
s(1) = 0}. 


(a) How many basis functions is S expected to have? Explain. 

(b) Construct a basis for S. {Hint: for v = 1,2,...,n take wu, = A,_, to be 
the quadratic function on [x,—;, x,] having values u,(x,—-1) = uy (x)) = 0, 
Uy (X,_1) = | and define A,_; to be zero outside of [x,—1, x,]. Add to 
these functions the basis of hat functions B, for S?(A).} 

(c) Compute the stiffness matrix U (in (7.142)) for the basis constructed in (b). 

(d) Interpret the resulting system U& = p as a finite difference method 
applied to the given boundary value problem. What are the meanings of 
the components of &? 
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(a) Show that the solution us of (7.141) is the orthogonal projection of the 
exact solution y of (7.138) onto the space S relative to the inner product 
[-, -]; that is, 
[vy —us,v] =0 forallve S. 


(b) With ||u||~ denoting the energy norm of u (i.e., ||u||7, = [u, u]), show that 


lly -uslle = llylle —llusile- 


. Consider the boundary value problem (7.127) and (7.124). Define the energy 


norm by |u||% = [w, u]. Let A, and Az be two subdivisions of [a, b] and S; = 
{s € Sk (Ai), s(a) = s(b) = O}, 7 = 1,2, for some integers m, k with 
O0<k<m. 


(a) With y denoting the exact solution of the boundary value problem, and A, 
being a refinement of Az, show that 


ly —us lle < lly —usylle- 


(b) Let A> be an arbitrary subdivision of [a, b] with all grid points (including 
the endpoints) being rational numbers. Prove that there exists a uniform 
subdivision A, of [a, b], with |A;| = A sufficiently small, such that 


lu —us, lle < lly —us,lle, 


where S; are as defined at the beginning of the exercise. 


. Apply the variational method to the boundary value problem 


Ly :=-—py"+qy=r(x), O0<x<1; 
y(0) = yd) = 9, 


where p and g are constants with p > 0, q => 0. Use approximants from the 
space S = span{u,(x) = sin(vzx), v = 1,2,...,n}, and interpret us. Find 
an explicit form for us in the case of constant r. 


. Let y be the exact solution of the boundary value problem (7.123)—(7.125) and 


us the approximate solution of the associated extremal problem with S = {s € 
S?(A)= s(a) = 9s) = O}and At a= xy < x1 <->: < xy < Xp =D. 
Prove that 


max O8C i, pail )s 


1 
ly —uslloo = 5 
O0<v<n 


2 


where oscj.aj(f) := maxfe.a) f — minteg) f and €, c are the constants defined 
in (7.134). In particular, show that 


ly —uslloo < $f IAIIIY"loo. 


{Hint: apply Theorem 7.4.4, in particular, (7.150).} 
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19. 


20. 


Consider the boundary value problem (7.127) and (7.124) with p(x) and q(x) 
being positive constants, 


P(x)=p>0, g(x)=¢q>0. 


Let S = {s € S?(A) : s(a) = s(b) = 0}, where the subdivision A: a = xo < 
Xp <XQ <+++ << Xy_ < X,41 = D is assumed to be quasi uniform; that is, 


Axy 1= Xy41—Xy = BIA], v=O0,1,...,”, 


for some positive constant 6. (Recall that |A] := maxg<)<, Ax,.) Let U be 
the stiffness matrix (cf. (7.142)) for the basis wu, = By, v = 1,2,...,n, of 
hat functions (cf. Chap. 2, Ex. 72, but note the difference in notation). Write 


u(x) = >-"_, yu, (x) for any u € S, and &" = [€, &,..., &:]. 
(a) Show that €'U& = [u, uJ. 


(b) Show that ||’ Iz, = &'T,&, where 7; is a symmetric tridiagonal matrix 
with 
ee =, p= 1,2 rarer n 
AXxy-1 Ax, 
1 


(T)v+iv = Ti)ov+i = - , vel,....nz—1. 
Axy 
{Hint: use integration by parts, being careful to observe that uv’ is only 
piecewise continuous.} 
Show that llullz., = &'T)&, where Ty is a symmetric tridiagonal matrix 
with 


(c 


wm 


1 
(To)v.» = yiAm-1 +Ax,), v=1,2,...,03 
1 
(To)v+iv = (To)vv+1 = gam VS To222 n= 1. 


(d 


w 


Combine (a)-(c) to compute [u, u] and hence to estimate the Euclidean 

condition number cond, U. {Hint: use Gershgorin’s theorem to estimate 

the eigenvalues of U.} 

(e) The analysis in (d) fails if g = 0. Show, however, in the case of a uniform 
grid, that when g = 0 then cond) U < 1/ sin? he 

(f) Indicate how the argument in (d) can be extended to variable p(x), g(x) 


satisfying 0 < p(x) < p, 0<q < q(x) <q on[a,)]. 


The method of collocation for solving a boundary value problem 
Ly =r(x), O<x<1l y(0)=y()=9, 


consists of selecting an n-dimensional subspace S C Vo and determining us € 
S such that (Cus)(x,) = r(x,,) for a discrete set of points 0 < x; < x2 < 
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- < XxX, < 1. Apply this method to the problem of Ex. 17, with S as defined 


there. Discuss the solvability of the system of linear equations involved in the 
method. {Hint: use the known fact that the only trigonometric sine polynomial 
>~ = & sin(vx) of degree n that vanishes at n distinct points in (0, 1) is the 
one identically zero.} 


Machine Assignments 


1. The following eigenvalue problem arises in the physics of gas discharges. 
Determine the smallest positive A > 0 such that 


1 
ge" +-9' +Ve-g)=0, O<r<l, 
r 


g(0) =a, g’(0) =0, g(1) = 0, 


where a is given,0 <a < I. 


(a) 
(b) 


(c) 


(d 


wa 


(e) 


Explain why 4 = 0 cannot be an eigenvalue. 

Reduce the problem to an initial value problem. {Hint: make a change of 
variables, x = Ar, y(x) = g(x/A).} 

Use Maple to determine the Taylor expansion up to the power x® of the 
solution y(x, qa) to the initial value problem of (b). 

Integrate the initial value problem starting at x = .1, using the Taylor 
expansion of (c) to determine the initial data y(.1, Py and oe _(.1,a). Use the 
classical Runge-Kutta method (for example, the Matlab foun of Chap. 5, 
MA 1(a)) and integrate until the solution y becomes negative. Then apply 
interpolation to compute an approximation to A, the solution of y(-,a) = 0, 
to an accuracy of about five decimal digits. Prepare a table of the A so 
obtained fora = .1 : .1 : .9, including the values of the integration step 
h required. 

Fora = .1: .1 : .9 use Matlab to produce graphs of the solutions y(x,a@) on 
intervals from x = 0 to x = A, the zero of y. (Determine the endpoints of 
these intervals from the results of (d).) Use the Matlab routine ode45 to do 
the integration from .1 to A and connect the points (0, a) and (.1, y(.1, a)) by 
a straight line segment. (Compute y(.1,a@)) by the Taylor expansion of (c).) 


2. The shape of an ideal flexible chain of length L, hung from two points (0, 0) and 
(1, 1), is determined by the solution of the eigenvalue problem 


1 
=AV1+ 02, y@)=0, yQ) =1, / /1+ O)'dx = L 
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Strictly speaking, this is not a problem of the form (7.3), (7.4), but nevertheless 
can be solved analytically as well as numerically. 


(a) On physical grounds, what condition must L satisfy for the problem to have 
a solution? 

(b) Derive three equations in three unknowns: the two constants of integration 
and the eigenvalue A. Obtain a transcendental equation for A by eliminating 
the other two unknowns. Solve the equation numerically and thus find, and 
plot, the solution for L = 2,4, 8, and 16. 

If one approximately solves the problem by a finite difference method over 
a uniform grid x9 = 0 < x1 < X2 <+++< XN <Xy41 = 1,%, = reat 
approximating the integral in the third boundary condition by the composite 
trapezoidal rule, a system of N + 1 nonlinear equations in N + 1 unknowns 
results. Solve the system by a homotopy method, using L as the homotopy 
parameter. Since for L = 2 the solution is trivial, select a sequence of 
parameter values Ly = J/2 < L, <-++< Ly» and solve the finite difference 
equations for L; using the solution for L;-; as the initial approximation. 
Implement this for the values of L given in (b), taking a sequence {L;} which 
contains these values. Compare the numerical results for the eigenvalues 
with the analytic ones for N = 10, 20, 40. (Use the routine £solve from 
the Matlab optimization toolbox to solve the system of nonlinear equations.) 


(c 


wa 


3. Change the boundary value problem of the first Example of Sect. 7.2.1 to 
y’=-e’, 0<x<1, yO)=y(1) =0. 


Then Theorem 7.1.2 no longer applies (why not?). In fact, it is known that the 
problem has two solutions. Use Matlab to compute the respective initial slopes 
y’(0) to 12 significant digits by Newton’s method, as indicated in the text. {Hint: 
use approximations s = 1 and s© = 15 to the initial slopes.} 

4. Consider the boundary value problem 


(BVP) y"=y’, O<x<b; y)=0, y(b) = 8, 
and the associated initial value problem 
(IVP) uw’ =u’, u(0) =0, v0) =s. 


Denote the solution of (IVP) by u(x) = u(x;s). 
(a) Let v(x) = u(x; —1). Show that 


V(x) = -1/ 34 (x) +1, 


and thus the function v, being convex (i.e., V’ > 0), has a minimum at some 
xo > O with value Vmin = —(3/2)!/3 = —1.1447142... . Show that v is 
symmetric with respect to the line x = Xo. 
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(b) 


(c 


wm 


(d 


wm 


(e) 


(f) 
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Compute xo numerically in terms of the beta integral. {Point of information: 


the beta integral is B(p,q) = ne t?-'(1 — r)?"'dt and has the value 
rwr@ } 

T(p+q) * . Lou. 

Use Matlab to compute v by solving the initial value problem 


y= Vv’, Xo <x < 3x0; v(X0) = Vmin> v' (xo) = 0. 


Plot the solution (and its symmetric part) on —x9 < x < 3X0. 
In terms of the function v defined in (a), show that 


u(x; —s*) = s*v(sx), alls ER. 


As s ranges over all the reals, the solution manifold {s?v(sx)} thus encom- 
passes all the solutions of (IVP). Prepare a plot of this solution manifold. 
Note that there exists an envelope of the manifold, located in the lower half- 
plane. Explain why, in principle, this envelope must be the solution of a 
first-order differential equation. 

Based on the plot obtained in (d), discuss the number of possible solutions 
to the original boundary value problem (BVP). In particular, determine for 
what values of b and 6 there does not exist any solution. 

Use the method of finite differences on a uniform grid to compute the two 
solutions of (BVP) for b = 3x0, B = vo, where vp = v(3Xo) is a quantity 
already computed in (c). Solve the systems of nonlinear difference equations 
by Newton’s method. In trying to get the first solution, approximate the 
solution v of (a) on 0 < x < 3x9 by a quadratic function v satisfying 
v(0) = 0, v’(0) = —1, (3x9) = vo, and then use its restriction to the grid 
as the initial approximation to Newton’s method. For the second solution, 
try the initial approximation obtained from the linear approximation V(x) = 
vox / (3X0). In both cases, plot initial approximations as well as the solutions 
to the difference equations. What happens if Newton’s method is replaced 
by the method of successive approximations? 


5. The following boundary value problem occurs in soil engineering. Determine 
y(r), 1 <r <0, such that 


1d dy 
~~ |ry—]+e0-y)=9, yO) =7, yoo) = 1, 
r dr dr 


where p, 7 are parameters satisfying p > 0,0 < 7 < 1. The quantity of interest 


iso = 


(a) 


dy 
dr 


r=1 
Let z(x) = [y(e*)]*. Derive the boundary value problem and the quantity of 
interest in terms of z. 
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(b) Consider the initial value problem associated with the boundary value 
problem in (a), having initial conditions 


z(0)= 77, 7(0)=s. 


Discuss the qualitative behavior of its solutions for real values of s. 


(c 


wm 


{Hint: suggested questions may be: admissible domain in the (x, z)-plane, 
convexity and concavity of the solutions, the role of the line z = 1.} 

From your analysis in (b), devise an appropriate shooting procedure for 
solving the boundary value problem numerically and for computing the 
quantity o. Run your procedure on the computer for various values of 7 and 
p. In particular, prepare a five-decimal table showing the values of s and o 
for n = 0.1(0.1).9 and p = 0.5, 1,2, 5, 10, and plot o versus 7. 
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1. (a) 


(b) 


We have z(t) = Aq3y'(A73t), 20) = Ay", Z(t) = 
A~3y(A75t). Put x = A73¢ in the given boundary value problem to 
obtain A32” + 1A3z-Az" = 0, that is, 


Zo Sz!" =0, 0<t<o. 
The initial conditions for z follow from those for y and the definition of A: 
2(0) = 7(0) = 0, z/(0) = 1. 


The boundary condition at co for y’ transforms to z’(co) = A-3, so that 
A = [7 (co)|-?. Thus, solving the initial value problem of (a) on [0, oo), 
we obtain z/(oo), hence A, hence y(x) as the solution of the initial value 
problem 


x" 4 Lyy" _ 0 
2 — , 
y0)=y0)=0, y"O)=A 


To explore the convergence of z(t) as tf —> oo, we run the small Matlab 
program 


SEXVII_1B1 


2 
6 


£0='%8.2£ $12.8f\n’; 

disp (’ E zprime’ ) 

z0=[0;0;1]; tspan=[0 5 10 15]; 
options=odeset (’AbsTol’, .5e-8) ; 

[t, z] =ode45 (@fEXVII_1,tspan,z0,options) ; 
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for i=1:4 
fprintf(f£0,t(i,1),z(i,2)) 
end 


SFEXVII_ 1 


2 
6 


function yprime=fEXVII_1(x,y) 
yprime=l[y(2) ;y(3);-y(1) *y(3)/2]; 


The results 


>> EXVII_ 1B1 


is zprime 
0.00 0.00000000 
5.00 2.08544470 
10.00 2.08553204 
15:.:0.0 2.08553204 


>> 


show that, for graphical purposes, z’(0o) = 2.0855 is an acceptable value 
for the limit. Thus, A = [z’(c0) 72 = .33204. We can now solve the initial 
value problem of interest. The program below plots the three components 
of the solution y(x) = [y(x), y’(x), y”(x)] on the interval [0, 5]. 


SEXVII_1B2 


global lambda 

lambda=.33204; 

yO=[0;0;lambda]; xspan=[0 5]; 
[x,y] =ode45 (@fEXVII_1,xspan,y0) ; 
plot (x,y) 


PLOT 
3.5 T T T T T T T T T 
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14. (a) The space S9(A) has degree of freedom 3n — (n — 1) = 2n + 1, since there 


(b 


we 


are n quadratics having three degrees of freedom each, and n — | continuity 
requirements at the n— | interior grid points. Since the boundary conditions 
reduce the degree of freedom by two, S has 2n — | degrees of freedom, and 
we expect this to be the number of basis functions in S. 

Following the Hint, let A,-\(x), v = 1,2,...,n, be supported on 
[x,—1,X,] and a quadratic function there, vanishing at the endpoints of 
the support interval and having the value 1 at the midpoint. Let B,(x), 
v = 1,2,...,n — 1, be the interior hat functions of those forming a basis 
of S?(A) (cf. Chap. 2, Sect. 2.3.2, but note the difference in notation). We 
claim that 


uy(x) = Ay-1 (x), V=1,...,05) Unger (x) = By(x),v = 1,...,n-1, 
is a basis of S (it has the correct number of functions). To prove this, we 


must show that span(u), u2,...,U2n—-1) = S and that the uw, are linearly 
independent. Let 


n n—-1 
u(x) = 2 Cy Ay—1(x) + > Catv By (x). 
v=1 v=! 


It is clear that u € S (note that Ao(0) = By (0) = 0, A,-1(1) = By-1 (1) = 


0), so that span(w,...,U2n—1) C S. Conversely, let s € S be an arbitrary 
member of S. Then it can be represented in the form above, 1.e., S C 
span(u),...,U2n—1). Indeed, note that forv = 1,2,...,n;4=0,1,...,n 


Ay-1(%p) = 0, Ay-1(%,41) = bys 


By (X,) = buys By (X41) = $80, + by ut). 


Thus, putting x = x,, we find s(x,) = Cn4, for w = 1,2,...,n—1, 
and putting x = X41, we get S(x,44) = Cut + 5 Cid + Chtp+1) 
for zw = 0,1,...,2 — 1 (where co, = 0). The first set of equations 
determines Cy+1,Cn+2,---,C2n—1, and the second set (written in reverse 
order) determines c,,Cy—1,...,C1. This proves span(u,...,U2n-1) = S. 
The linear independence follows likewise (put s(x) = 0 in the argument 
above). 
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(c) Straightforward interpolation gives 


Avi(s) = FO- a) [1-FO-5,p)] om bots) 


1 aaah | 
whereas 
X—Xy-1 
—, on [xy-1, Xv], 
By(x) = ; v=1,2,....n—1. 
Xy+1—X 
——— on [xy, x41], 
h 
Hence, 


A A) = Slav —1)h—2x] on [x,-1, x,], 


1 
— on [x)-1, Xv], 
Bi(x) = 


“F on [x,, Xy41]. 


It is clear that the leading n x n diagonal block of U is a diagonal matrix, 
since i Aj) (x)Ai,_1(x)dx = Oif v A wand p = 1,g = 0 in (7.130). 
Its diagonal elements are 


1 Xp 
/ [A ~o/de= ie alr — 1h —2xPdx 


16 [v — 1h — 2x9? |” 16 
= Sy PH 12s; 
h4 3+ (—2) 3h 


Xy—1 


The n x (n — 1) block of U consisting of the first n rows and last n — 1 
columns is the zero matrix (and hence also the block symmetric to it). This 
is so because 


1 
/ Al _,(x) Bi (x)dx =0 if|u—v| > 1, 
0 


on the next page the integrand being identically zero, and since by 
symmetry (see the figure for v = 2 and h = 1/4), 
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= 


0.8 


0.6 


0.4 


0.2 


1 1 
/ Aj (x) By_, (x)dx = / A’ (x) Bi (x)dx 
0 0 


[ “Oe = thea 
= sz l(2v — lh — 2x|—dx 
ty h 
_ 4 [Qv—Dh-2xP |" 
h3 De) 


= 0. 


Xy-1 


Finally, the last (n — 1) x (m — 1) diagonal block is tridiagonal, since 


1 
i By(x) Bi (x)dx =0 if |u—v| > 1, 
0 
and 
1 9 1 1 
i [Bijfde=—, / Bi (x) Bi (x)dx = ——. 
0 h 0 h 
Thus, 


hil oT 3 
(d) The first 7 components of the load vector p (cf. (7.142)) are 


1 1 
i ar ar pS FS 
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1 Xy 
Pv =| ra)dya(ade fo reer) 1 - (x - +) dx. 
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With the change of variable x = x,—, + th, this becomes 


1 
—— ah [ r(xy-1 + th)t( —t)dt, v=1,2,...,n. 
0 


The remaining components are 
1 
— 
ae / 1 (2) By (x)dx = ik re) ag ce ro) tt ar, 
0 Xy-1 xXy 


which, by the change of variable x = x, + th, becomes 


0 I 
Pntv =h if r(xy + th)( + t)dt +f r(x) + thy) = nar} 
-1 0 
V= 12 .02...n— 1, 
They can readily be interpreted in terms of weighted averages of r over 


one or two consecutive subintervals. Indeed, if we introduce the weight 
functions 


1l+?r, —1l<t<0O, 


wo(t) =t(1—t), 0<t <1; m=) 0<t<i1 


then, since is wo(t)dt = i and i w (t)dt = 1, we can write 


1 
/ r(xy—1 + th)wo(t)dt 


= hh 7, = 22 , V=1,2,...,n, 


1 
/ wo(t)dt 
0 


and 


1 
ji r(x, + th)w,(t)dt 


Pn4+v =hi,, y= i 1 : v=1,2,...,n—-1. 
/ wi (t)dt 
-1 


We can now interpret the system of linear equations U& = p as follows. 
First note that in 


n—-1 


u(x) = eA v— 13) + D Bry Br (x) 


v=! 
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we have from (b) that 


u(Xp) = En tpes b= 12a ght = 1, 


and 


1 
u(x,44) = End F 5 (En+n - Entuti) 


1 
= Eva t+ urn) + u(Xp41)), @ =90,1,...,02—-1. 


Thus, writing uv, = u(x), the meaning of the -components is 


By Taylor’s formula centered at x,—1/2, 


2 
& = Wy) + O(h’). 


The first n equations of the system U&é = p thus are 


16 


2 
a7 SN = xh ry, 
pe 


that is, 
—u"(x,_1) =F,+ O0(h’), v=1,2,...,n 


The remaining equations are the standard finite difference equations 
—(uy—1 — 2uy + uy41) = WF, + O(n), v =1,2,..., jo= 1, 


where up = Uy, = 0. 
19. (a) We have 


[u, ul = Don im = > [uy Up |EvEn = é'UE. 


v= 


(b) We have 


b n Xy41 
lz, = | W’@oPdx = [u' (x) Pdx. 
n= | ma 
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To each integral in the summation on the right, we apply integration by 
parts: 


Xy+1 = Xy+1 
/ ul (xyul (x)dx = uu) / wl" (x)u(x)dx 
= ul (xXy41 — O)u(xy41) — u! (xy + Our), 


since u’(x) = 0 on (xy, X)41). On the interval [xo,x,] one has u(x) = 
&, B, (x), hence 


u(xo) = 0, u(x) = &1; u'(x0 + 0) = w'(x1 — 0) = re 


On the interval [x,, x 41] (1 < v < n — 1), one has u(x) = & By(x) + 
&,41By41(x), hence 


1 
u(xy)=&, u(xy41) = S415 ul (xy +0)= u' (x, 41-0) = Ax, Seti b). 
Finally, on [x;,Xn+1], one has u(x) = &, B, (x), hence 


U(Xn) = En U(Xn+41) = 0; ie On +0) = Ul (Xn+41 _ 0) = “aa 


There follows 


n—1 


t4 l I 1: 5 
remas » Fen = Ev ev+1 >= rem oad = BE] iG, 


lw’ IZ, 


n—-\ 


a) 2 2 Loo 
as y ee _2 7 
Axo gi = Ax, G41 Svbu+1 + §0) AXn Sn 


n—-1| 


1 I 55 1 
2 
a (—— + ae) gS dX Axy fyev41 
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(c) We have 


n 


II 


lel, 


/ “eet = / 


n b 
a ag f B,(x) By, (x)dx. 
1 a 


Vi L= 


&, By (x) > E, By (x)dx 
1 p=l 


p= 


II 


Since 7 B,(x)B,(x)dx = 0 if |w—v| > 1, the matrix of this quadratic 
form is tridiagonal, and clearly symmetric. One computes 


tog Pe Pig WH (x44 —x\7 
B?(x)dx = “TY |d NM“) 4d 
[ wovw=[" ASS) et [ ARS) « 
1 1 
= Ares [ Par + day | (1 —t)*dt 
0 0 


1 
— 3 At + Ax,), v=1,2,...,n, 


and 
WHI Ky 4p XX —Xy 
By (x) By4i(x)dx = a aa dx 


1 
1 
= Ax, [(-orat = ge v=1,2,...,.n—-1. 
0 


There follows 
lulz, = TM. 


as claimed. 
(d) We have by (7.130) and the results in (b) and (c), 


[u,u] = pllu'Z, + allull,, = €"(PT: + qTM)é. 


Comparison with (a) shows that U = pT, + qT . The Gershgorin disks of 
the tridiagonal matrix U have centers at 


1 1 i 
Avni + AR), WH 12, cvigh; 
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with respective radii 


i ff 
4 =gAxi, 
Pig See 


1 1 1 
Ax,-1 + Ax,), =2,...,n—-1, 
»(a— + a) + 3 Xy-1 + Axy), v n 


1 
—_ + ~—gAXp-1. 
Pica 7 6! sai 


Since by Gershgorin’s theorem all eigenvalues 4, of U are contained in the 
union of these disks, and are real, we have 


1 1 1 1 2 
An max |2p (5 + =} + 54 (54% + Axi); 


1 1 1 
2 Axi + Amy), = 2,...5 -1; 
pC + ao) + pat Aa), v= Ban 


2 eee ee ee 
Xy-— - AX * 
Ph he Ae) Oe a 


By the quasi uniformity of the grid, we have 


End A 
< ? 
Ax, ~ BIAl 


w= 0.1. cn; 


and therefore ri 
P 
AK wiax < “Alloa + q\A|. 
BIA 


Similarly, 


1 1 
Amin > Min } p Lee + 64 (2Axp + Ax); 
0 


1 
G4 (A%-1 + Ax), v=2,...,n—1; 


1 1 
—q (An— 2AXn)? - 
(es + 54 ( 1 + 2AXxn) 


P 
By the quasiuniformity of the grid and p > 0, we get 


1 
Amin = slAl. 


There follows 

Anax 12p 3 
Spa 

Amin ap |A| B 


cond, U = 
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(e) 


(f) 


It is clear from (d) that in the case g = 0 one has U = pT;, and fora 


: a . _ . _ bea das 
uniform grid with |A| = h = 7+, by the definition of 7}, 


[u, u] = gre rs, U=2F 
1 


=('s 


where T = tri(—1, 2,—1) is a tridiagonal matrix of order 1 with elements 
2 on the diagonal and —1 on the two side diagonals. Thus, cond) U = 
cond, T. To find the eigenvalues of 7, note that the characteristic polyno- 
mial z,(A) = det(T — AT) satisfies 


T+ (A) = (2—A) my (A) — mR (A), Kk =0,1,...,n-—1, 
m(A) = 0, mo(A) = 1. 


Hence, in terms of the Chebyshev polynomial T,,, one has (cf. Chap. 2, 
Sect. 2.2.4, (2.83)) 


! 1 
(A) = Th (1 = 51) ,n>2; m(A)= 27, (1 _ 54) : 


2v—-1 
2n 


For the eigenvalues 1, we therefore have 1 — 5Ay = COs , thus, 


It follows that 


2n — 
Anax = 4 sin? i 7 = dcos? & , Amin = 4sin = ‘ 
2n 2 4n 4n 


and 


4 
cond, U = cot” —. 
4n 


Je. 


In particular, cond, U < 1/ sin? ae 
In (7.130) use the mean value theorem of integration to write 


[u.u] = plu’ Ii7, + a(n) llaullz,- 


Hence, p and q, throughout the argument in (d), can be replaced by p(&) 
and q(7), respectively. The result is 


12p 3 
cond, U < ——_4-. 
qB’|A??— B 
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1. (a) IfA = 0, then 


giving 
g(r) =c)Inr +c 


with constants c), C2. The first condition g(0) = a requires c; = 0, co = a, 
hence g(r) = a. This contradicts the third condition, y(1) = 0, since a > 0. 
(b) With the suggested change of variables x = Ar, y(x) = g(x/A), we have 


dy _ 1 ! d’y _ 1 " 
dy ra (x/A), ae 92" (x/A). 


Putting r = x/A in the boundary value problem, we get 


d 
AZ +My(1—y) =0, 
dx 
dy 
y(0) =a, ao y(A) =0, 
X 


that is, 


dy ildy 
a2 t xd tVG-Y=4 


y(0) =a, gy (0) = 0. 
dx 


Thus, we need to integrate this initial value problem until the solution y 
vanishes for the first time. The corresponding value of x is the desired 
eigenvalue. 


(c) 


PROGRAM (Maple) 


eq:=diff (y (x) ,x,x)+(1/x) «diff (y (x) ,x) +y (x) * (1l-y (x) ) =0; 
ini:=y(0)=a, D(y) (0)=0; 

Order:=10; 

sol:=dsolve({eq,ini}, {y(x)},type=series) ; 

p:=convert (sol,polynom) ; 
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The results produced for the coefficients co,c,,C2,... in y(x,a) = co + 
cx + cox? ... are 


Co = 03 =C5 = C7 = 0, 


c= a, 
1 
OQ= 4 a(1—a), 


1 
C= a —a)(1—2a), 


1 
=- tw ay 
C6 7304 a(1—a)(1 — 8a + 8a‘) 
1 
= 1—a)(1—2a)(1—2 26a’). 
Cg 147456 a(1 — a)( a)( 6a + 26a“) 


(d) We use cubic interpolation (which has the same order O(h*) of error as the 
Runge-Kutta method) and add one more integration step after the solution 
has become negative to make the interpolation problem more symmetric. 


PROGRAMS 
SMAVII_1D 


£0='%6.2f %9.5f %$11.4e\n’; 
(' a lambda h’) 
y=zeros (4,1); 
for a=. 1.9 
h=.01; laml=0; errlam=1; 
while errlam>.5e-5 
h=h/2; lamO=laml; x=.1; 
y (2) =yMAVII_1(x-2«h,a) ; 
y (3) =yMAVII_1(x-h,a) ; 
[Ly (4) ,yl]=yMAVII_1(x,a); 
ul=[y (4) ;y1]; u=1; 
while u>0 
ud0=ul; y(1:3)=y(2:4) ; 
ul=RK4 (@£fMAVII_1,x,u0,h) ; 
y(4)=ul(1); u=y(4); x=x+h; 
end 
y(1:3)=y(2:4) ; 
u2=RK4 (@£fMAVII_1,x,ul,h); 
y (4) =u2(1); p=[(-y(1)+3+*y(2) 
-3xy(3)+y(4))/6 (2*y(1)-5«*y(2) 
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+4xy(3)-y(4))/2 (-11l«*y(1) ; 
+18xy (2) -9xy(3)+2*y(4))/6 y(1)]; 
r=roots(p); lam1=0; 
for k=1:3 
if l<r(k) & r(k)<2 
laml=x+(r(k)-3)«*h ; 
end 
end 
if laml== 
fprintf([’interpolant not in’ 
‘range for a=%5.2f\n’'],a) 
return 
end 
errlam=abs (laml-lam0) ; 
end 
fprintf(£0,a,lam1,h) 
end 


SFMAVII_ 1 


x 
© 


function yprime=fMAVII_1(x,y) 
yprimes=[y (2) ;-y(2)/x-y(1)*(1-y(1))]; 


SYMAVITI_ 1 


x 
© 


function [y,y1]=yMAVII_1(x,a) 

y=a- (1/4) *ax (1-a) *x*2+ (1/64) xax (1-a) * (1-2«a) *x74- (1/2304) 
wax (1l-a) * (1-8%*a+8x%xa%2) *x*64+(1/147456) xax (l-a) * (1-2*a) 
x (1-26*a+26*a~2) *x*8; 

yl=- (1/2) xax (1l-a) *x+ (1/16) *ax (1-a) * (1-2*a) *«x*3- (1/384) 
xax (1-a) * (1-8*a+8%a%2) *x*54+ (1/18432) *ax (1-a) * (1-2*a) 
* (1-26x*a+26*a~2) *x*7; 


OUTPUT 

>> MAVII_1D 
a lambda h 
0.10 2.49725 4.8828e-06 
0.20 2.60240 4.8828e-06 
0.30 2.72378 4.8828e-06 
0.40 2.86654 4.8828e-06 
0.50 3.03864 4.8828e-06 
0.60 3.25347 4.8828e-06 
0.70 3.53610 4.8828e-06 
0.80 3.94284 4.8828e-06 
0.90 4.65326 4.8828e-06 


>> 
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(e) 
PROGRAM 


SMAVII_1E 


° 
6 


endpoint=[2.497;2.602;2.724;2.867; 
3:03 9:33:.253:73 5369394346531); 


hold on 
for ia=1:9 
a=lax.1; 


[y, yl] =yMAVII_1(.1,a); 
u0=[y;y1]; xspan=[.1 endpoint (ia)]; 
[x,u] =ode45 (@fMAVII_1,xspan,u0) ; 
plot (x,u(:,1)) 
plot([0 .1], [a yl) 

end 

plot([0 5],[0 0]) 

axis([0 5 -.1 1]) 

hold off 


PLOTS 


0.9 


4. (a) The function v(x) = u(x; —1) satisfies 


ld 
2 dx 


See 


)2 
(Vv) = 3 da 


which, when integrated from 0 to x, and using v(0) = 0, v’(0) = —1, gives 


pr -1= 500, 


536 


(b 


(c) 


wm 
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that is, 


V(x) = +73 (x) + 1. 


Since v has a negative slope at x = 0, and v(0) = 0, it initially decreases and 
becomes negative. Therefore, in the formula above, we must take the square 
root with the minus sign. This proves the assertion. Let v(xp +t) = z(t). 
Both functions z* satisfy the same differential equation z” = z* with the 
same initial conditions z(0) = Vmin, z/(0) = 0. Hence, they are the same, 
z(t) =z (t), proving symmetry of v with respect to the line x = xo. 
From part (a) we have 


d d 1 
= —)/2 341, ~=- ‘ 
dx dv f243 41 


from which there follows, since x = 0 for v = 0, 


7 dt 
; ; 
O ahead 


x= x(v):= 


Since x9 = X(Vmin), we get 

Vmin dt \Ymin| dt 3 1/3 
w=-f ——=-| ff, _2-3" bal = (5 

The change of variables 3 13 = 1, dt = (18)~'/34-7/3dr yields 


vx_ Td/3) 
(18)"/3 15/6)’ 


1 
xo = 18)" [28 —1ar = 8B.) = 
0 


since [(1/2) = ./z. With the Matlab command 
xO0=sqrt (pi) «gamma (1/3) / (18% (1/3) «gamma (5/6) ) 


one computes 
Xo = 1.605097826619464. 


PROGRAM 


SMAVII_4C 


2 
6 


x0=1.6050978; vmin=- (3/2)* (1/3); 
vspan=[x0 3*x0]; vO=[vmin;0]; 
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options=odeset (’AbsTol’, .5e-8); 
[x,v]=ode45 (@£fMAVII 4C,vspan,v0,options) ; 
v(size(v,1),1) 

hold on 

plot (x,v(:,1)) 

plot (2*x0-x,v(:,1)) 

hold off 


SEMAVII 4C 


function vprime=fMAVII_ 4C(x,v) 
vprimes[v(2);v(1)%*2]; 


PLOT 


(d) Let u(x) = s*v(sx). Then 


2 


—+> 4S 


4" 
dx? . 


(sx) = s4V(sx) = wv’, 


showing that u satisfies the differential equation of (IVP). Furthermore, 


d 
u(0)=0, —u(x) = 7 v (sx)| <= =i 
dx = a 


0 


Therefore, u(x) = u(x; —s*) as claimed. 
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PROGRAM 
% MAVII 4D 


x0=1.6050978; xf=3*x0; 
vspan=[x0 3+*x0]; vmin=- (3/2) %*(1/3); 
vO=[vmin;0]; options=odeset (’AbsTol’, .5e-8); 
[x,v] =ode45 (@fMAVII_4C,vspan,v0,options) ; 
hold on 
for s=-2:.2:-.2 
sx=x0:.01:xf; 
vsx=interpl (x,v(:,1),sx,spline) ; 
xs=sx/s; 
plot (xs, (S*2)*vsx) ; 
sx=x0:-.01:-x0; 
xs=sx/s; 
plot (xs, (s*2)*vsx) ; 
axis([-15 15 -6 10]) 
end 
for’ s62..23.222 
sx=x0:.01:xf; 
vsx=interpl (x,v(:,1),sx,spline) ; 
xs=sx/s; 
plot (xs, ($*2) *vsx) 
sx=x0:-.01:-x0; 
xs=sx/s; 
plot (xs, (s*2)*vsx) ; 
axis([-15 15 -6 10]) 
end 
hold off 


PLOTS 
10- 


-15 -10 5 0 5 10 15 
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(e 


Na 


(f) 


If y = e(x) is the equation of the envelope, and v(x; 5s) := s?v(sx), then for 
each x € R there must exist an s such that 


e(x) =v(x:s) and e(x) =v (x;s), 


where prime means differentiation with respect to x. If the first equation is 
solved (in principle) for s as a function of x and e, and the result s = s(x, e) 
substituted into the second equation, there results the (rather complicated) 
differential equation 

e’ = v(x; 8(x,e)). 


If the point (5, 6) lies in the domain above the envelope, there are exactly 
two solutions of (BVP), one touching the right-hand border of the envelope, 
the other the left-hand border. If (b, 6) lies on the envelope, there is exactly 
one solution, the one touching the envelope at the point (b, 8). If (b, B) lies 
in the domain below the envelope, there cannot exist any solution. 

Given the uniform grid A : x9 = 0 < x1 < +++ < Xy < Xy41 = 3X0, with 
|A| = h = 3x0/(N + 1), the simplest difference equations are 


ni — 2, Ee SH, SH Ty 2X, 
uy = 0, UN+1 = Vo. 
Writing wu = [uw,u2,...,un]' and letting A = tri(1,—2,1) be the 


tridiagonal matrix of order N with elements —2 on the diagonal and 1 on 
the side diagonals, the system of difference equations can be written as 


f(u)=0, f(u):= Au—h’w + wen, 


where u? = [ut,u5,...,uy]' andey = [0,0,..., 0, 1]. The Jacobian of f 
is 
fu(u) = A — 2h*diag(u), 


where diag(m) is the diagonal matrix with diagonal elements uv, u2,...,uUy- 
Newton’s method therefore takes the form 


WN ye ge f,a@)d = fal), i =0,1,2,..., 


where ul = fla. YE, yl = V(x) resp. yl = v(x,). The 
functions v, ¥ determining the two initial approximations are 
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The program implementing this method with N = 50 and accuracy 
tolerance ¢) = .5 x 107%, and with initial approximations as suggested, is 
shown below. 


PROGRAM 


N=50; eps0=.5e-8; 
x0=1.6050978; vO0O=2.2893929; 
h=3*x0/(N+1); h2=h*2; 
u0=zeros(N,1); 
A=zeros (N) ; 
e=eye (N) ; 
eN=e(:,N); 
hold on 
for i=1:2 
for n=1:N 
if (i==1) 
u0 (n) =n«*h«* (-1+((n*h) / (3*x0) ) 
* (1l+v0/(3*x0))); 
else 
u0 (n) =n*h*v0/ (3*x0); 
end 
A(n,n)=-2; 
if (n<N) 
A(n,n+1)= 


1 


0 55 -1.5 2.5]) 
[0 55], [0 0]) 
ul=u0; u0=zeros(N,1); 


while (norm(ul-u0,inf)>eps0O & it<20) 
it=it+l; 


f=Axu0-h2*u24+v0x«eNn; 
J=A-2xh2*diag(u0) ; 
d=J\£; 
ul=u0-d; 

end 
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ulp=[0;ul;v0]; 
plot (ulp) ; 
text (26,-1,’1st solution’ ,’FontSize’ ,14) 
text (21,.6,’2nd solution’ ,’FontSize’ ,14) 
end 
hold off 


When applying this routine, we get rapid convergence (within five to six 
iterations). Plots of the answers are shown below. The dashed lines are initial 
approximations. 


2.5- 


The method of successive approximations, described by 
ul = 4 aul? — ven), i =0,1,2,..., 


does not converge, neither for VN = 50 nor for N = 20, 10, or 5. 
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applications of, 199 
properties of, 175 
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Gauss’s principle 
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formula, 208 
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history of, 199 
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Gauss—Legendre formula, 368 
Gauss—Lobatto quadrature formula, 172, 199, 
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positivity of, 205 
Gauss—Radau quadrature formula, 172, 199, 
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positivity of, 205 
Gauss-Seidel method, 291 
nonlinear, 291 
Gauss—Turan quadrature formula, 204 
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algorithm for computing, 176 
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error term of, 177 
two-point, 172 
Gaussian quadrature rules, 258 
applications of, 178 
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generating function, 411, 412 
Gershgorin’s theorem, 265, 288 
golden mean, 270 
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procedure, 64, 65, 69 
modified, 114 
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maintaining second-order 
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predetermined, 343 
produced dynamically, 343 
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uniform, 343 
grid function, 333 
infinity norm of, 345, 402 
norm of, 343 
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collection of, 343 
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active, 408 
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Halley’s method, 301 
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hat functions, 105 
heat equation, 327 
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Hermite interpolation, 74, 97, 
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error term of, 98, 177 
explicit formulae for, 116 
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elementary, 207 
Hermite interpolation problem, 97, 97, 107 
Hermite polynomials 
monic, 206 
Hessian matrix, 339 
calculation of, 195 
Heun’s method, 338 
order of, 340 
Hilbert matrix, 22, 24, 63 
condition number of, 23 
condition of, 23 
Euclidean condition number of, 42 
explicit inverse of, 30, 42 
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Horner’s scheme, 281 
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approximate, 333, 334 
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continuity of the solution with respect to 
initial values, 357, 372 
for ordinary differential equations, 325 
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solution tracks in the numerical 
solution of, 357 
standard, 328 
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discrete, 124 
for Sturm-Liouville problems, 504 
homogeneity of, 59 
linearity of, 59 
of Sobolev type, 114 
positive definiteness of, 59 
symmetry of, 59 
integral equations 
nonlinear, 256 
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kernel of, 256 
numerical solution of 
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the numerical evaluation of 
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integrating factor, 480 
integration 
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multiple, 197 
texts on, 197 
numerical, 165, 257 
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197 
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near optimality of, 117 
interpolation 
nth-degree 
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points, 80 
at the roots of unity, 85 
convergence of, 85 
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by piecewise linear functions, 117 
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convergence of, 81, 82 
error of, 77, 78 
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inverse, 100 
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operator, 76 
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homogeneity of, 76 
projection property of, 76 
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error term of, 160 
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leading coefficient of, 94, 96 
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algorithms for, 115 
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triangular array of, 81 
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uniqueness of, 75 
interpolation problem, 57 
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inverse function 
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error of, 116 
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efficiency index of, 261 
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Jackson’s theorems, 116 
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for the weight function w, 176 
importance of eigenvalues and 
eigenvectors of 
for Gauss quadrature rules, 199 
Jacobian elliptic functions, 513 
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k-step method 
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k-step methods, 332, 399 
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of order p 
asymptotically best, 427 
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L, approximation, 113 
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error term of 
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in the complex plane 
convergence domain for 
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method 
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asymptotic error constant for, 268 
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METOD PROGONKI, 510 
metric, 32 
midpoint rule 
for differential equations, 404, 407 
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analytic determination of the 
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of order p 
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in infinite-dimensional spaces, 290 
in parallel shooting, 491 
in shooting methods, 483, 485, 487 
local convergence of, 278 
Matlab program for, 276 
quadratic convergence of, 278 
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limit distribution of, 116 
nonlinear functionals 
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ODE, see ordinary differential equations 
one-step and multistep methods 
unified treatment of, 373 
one-step methods, 332, 343, 399, 450 
A-stability of, 362 
criterion of, 366 
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applied to the model problem, 361 
asymptotic global error estimates of, 347 
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334 
convergence of, 333, 344, 347, 348 
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error accumulation in, 358 
exact order of, 334 
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global error of, 333 
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error in, 357 
monitoring the global error of, 352 
order of, 334 
principal error function of, 334, 348, 352 
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residual operator R;, of, 344 
second-order two-stage, 339 
principal error function of, 340 
single step of, 333, 333 
stability criterion for, 345 
stability inequality for, 345 
stability of, 333, 344 
step control in, 352, 359 
truncation error of, 333 
variable-method codes for, 376 
variational differential equation for, 352, 
375 
operator 
linear symmetric, 504 
operator norm, 14 
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optimal control problems, 325 
optimization, 195 
texts on, Xxiv 
order of convergence, 259 
order star theory, 377 
on Riemann surfaces, 456 
order stars 
applications of, 377 
order term O(-), 190, 402 
ordinary differential equations 
initial value problems for, 325 
numerical solution of 
texts on, 371 
solution by Taylor expansion of, 195 
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computation of, 115 
discrete, 69 
as interpolation polynomials, 73 
method of, 114 
discrete orthogonality property of, 
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interlacing property of the zeros of, 207 
of Sobolev type, 114 
reality and simplicity of the zeros of, 207 
recurrence coefficients for, 70, 176 
standard text on, 115 
Sturm property of, 264 
symmetric, 71 
table of some classical, 177 
three-term recurrence relation for, 70 
orthogonal systems, 59, 60, 63, 65 
examples of, 67 
linear independence of, 61 
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of functions, 60 
orthonormal polynomials 
three-term recurrence relation for, 123 
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Padé approximation 
texts on, 377 
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to the exponential function, 360 
explicit formulae for, 363 
properties of, 364 
Padé approximation 
to the exponential function, 362 
parallel computation, xix 
parallel computing 
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partial differential equations 
numerical solution of 
texts on, Xxv 
Peano kernel, 188 
definite of order r, 188 
of the functional L associated with a 
multistep method, 406 
PECE method, 413 
characteristic polynomial of, 424 
local truncation error of, 415 
estimate of, 415 
order of, 414 
principal error function of, 414 
residual operator R;, of, 424 
truncation error of, 414 
Milne estimator of, 415 
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polynomial interpolation, 73 
theory of, 159 
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257 
algebraic, 67 
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deflated 
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orthogonal, 69, 69 
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trigonometric, 167 
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of a matrix, 62 
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Adams-type 
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problem 
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of apportionment, 8, 30 
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two-point boundary value, 254 
product integration 
of multiple integrals, 199 
projection methods 
application of 
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convergence analysis of, 512 
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QR algorithm, 177 
quadature formulae 
optimal, 197 
Quadpack, 197 
quadratic equation, 32 
solving a, 29 
quadratic form, 62 
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quadratic interpolation 
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error of, 79 
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weighted, 165 
quadrature formula 
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rational arithmetic, 5 
rational functions, 56 
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abstract notion of, 28 
axiomatic approach to, 2 
development of the concept of, 28 
recurrence coefficients 
table of, 199 
reference solution, 333, 344, 357 
derivatives of, 336 
regula falsi, 289 
relative error, 6 
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residual operator R, 344, 401 
residual operator R;,, 420, 429 
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Rodrigues formula, 71, 72 
Rolle’s theorem, 78 
Romberg integration, 190, 195, 218 
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Romberg schemes 
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root condition, 416, 419-421, 424, 427, 
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roots 
qualitative properties of, 254 
roots of unity, 196 
Rouché’s theorem, 365 
derivation of, 377 
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symmetric, 6, 7 
roundoff errors, 1 
statistical theory of, 453 
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order 2r, 368 
A-stability of, 368 
Runge-Kutta formulae 
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regions of absolute stability for, 378 
embedded, 356 
4(5) and 7(8) pairs of, 376 
implicit, 367 
pairs of, 356 
Runge-Kutta method, 341 
r-stage, 341 
Butcher array for, 374 
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quadrature order of, 374 
stage order of, 374 
algebraically stable, 376 
contemporary work and history of the, 
375 
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maximum attainable order of, 342 
of orders twelve and fourteen, 373 
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efficient implementation of, 377 
implicit r-stage, 341 
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semi-implicit, 375 
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of the independent variable, 334 
Schrédinger’s equation, 325 
Schwarz’s inequality, 59, 505 
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scientific computation, xix 
scientific computing 
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secant method, 269, 277 
efficiency index of, 273 
extension to systems of equations, 289 
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equations, 287 
history of, 289 
local convergence of, 270 
Matlab program for, 273 
order of convergence of, 272 
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truncated, 332 
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shooting, 483, 485, 510 
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an example for the, 486 
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single-method schemes, 344 
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A-stable, 377 


Index 


Slatec, xxi 
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numerical approximation 
and software for, 113 
theory of, 325 
spline functions 
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of degree m 
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spline interpolant 
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convergence of natural, 118 
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error of, 118 
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error bounds of, 118 
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complete 
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minimum norm property of, 118 
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concept of, 454 
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relative, 377 
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text on, 376 
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problem, 333 
of differential equations, 360 
phenomenon of, 361 
stiffness matrix, 507, 509 
Stirling’s formula, 34, 129 
stopping rule, 261 
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